Skip to main content

Stacks Data Utilities

stacks-data is a Python library, containing a suite of utilities to accelerate development within an Ensono Stacks Data Platform. It is an integral part of the platform, supporting common tasks such as generating new data engineering workloads and running Spark jobs. stacks-data consists of:

  • Datastacks CLI - A command-line interface for data engineers, enabling interaction with Datastacks' various functions.
  • Data workload generation - Generate new data workloads based upon common templates.
  • PySpark utilities - A suite of reusable utilities to simplify development of data pipelines using Apache Spark and Python.
  • Data quality utilities - Utilities to support the data quality framework implemented in Stacks.
  • Azure utilities - Utilities to support common interactions with Azure resources from data workloads.
  • Behave utilities - Common scenarios and setup used by Behave end-to-end tests.

Setup

The following setup steps will ensure your development environment is setup correctly and install stacks-data in your Python virtual environment:

Alternatively, you can directly install stacks-data from PyPi, using:

pip install stacks-data

For information on utilising stacks-data from within Databricks - see development in Databricks.

Azure environment variables

Several environment variables are required by stacks-data to interact with Azure services. The environment variables you require differ depending on which processes you are running, and where you are running them from (e.g. your local machine, a Databricks cluster, or a CICD pipeline). A Stacks Data Platform will automatically ensure the required environment variables are made available from CICD pipelines and Databricks job clusters. However, to run processes from your local machine or a different Databricks cluster, you will need to configure these manually.

Storage account names

Environment variables defining the storage account names are required both for running Spark jobs and triggering end-to-end tests - so should be defined wherever you run these tasks from (e.g. local machine, Databricks cluster).

Environment variable nameDescriptionExample value
ADLS_ACCOUNTAzure Data Lake storage account name.amidostacksdeveuwdeadls
CONFIG_BLOB_ACCOUNTBlob Storage account name used for config data.amidostacksdeveuwdeconfi

Running Spark jobs

For running Spark jobs, you need to define the storage account names as well as the environment variables below. If developing Spark jobs from within Databricks, these variables will need to be set on your cluster and should reference values from the key vault - see PySpark development in Databricks.

Environment variable nameDescriptionExample value
AZURE_TENANT_IDDirectory ID for Azure Active Directory application.00000000-0000-0000-0000-000000000000
AZURE_CLIENT_IDService Principal Application ID.00000000-0000-0000-0000-000000000000
AZURE_CLIENT_SECRETService Principal Secret.secretValue123456

Running end-to-end tests

In order to trigger end-to-end tests, you need to define the storage account names as well as the environment variables below. Running end-to-end tests requires access to various Azure resources, for example to prepare and tidy up test data, and trigger Data Factory pipelines. AZURE_TENANT_ID, AZURE_CLIENT_ID and AZURE_CLIENT_SECRET may also be provided to authenticate with Azure, or alternatively if running the tests locally you can authenticate by signing in to the Azure CLI.

Environment variable nameDescriptionExample value
AZURE_SUBSCRIPTION_IDAzure subscription ID.00000000-0000-0000-0000-000000000000
AZURE_RESOURCE_GROUP_NAMEName of the resource group.amido-stacks-dev-euw-de
AZURE_DATA_FACTORY_NAMEName of the Data Factory resource.amido-stacks-dev-euw-de
AZURE_REGION_NAMEAzure region in which the platform is deployed.West Europe