This section covers the steps required to start developing a Ensono Stacks Azure Data Platform from your machine:
- Make sure you have installed the applications in local development requirements.
- Ensure that Poetry is added to your
Poetry will be used to create a Python virtual environment for the project, and install the project's dependencies (including stacks-data). A
make command has been created to assist with the initial setup, as well as installing other development tools such as pre-commit.
You may wish to enable the virtualenvs.in-project configuration setting in Poetry - this will ensure that the Python virtual environment for the project gets created within the project directory, which can simplify management and integration with your IDE. To set this, run
poetry config virtualenvs.in-project true.
To setup your local development environment, run the following commands:
# Use make command to setup your local development environment
# Enter the Poetry virtual environment shell
(Optional) Azure connection
In order to interact with Azure resources when developing, including running end-to-end tests, you must:
- Sign in to the Azure CLI
- Set the following environment variables:
make commands are provided to assist with running tests while developing locally. See testing for further details on these tests.
If you encounter PATH-related issues with Poetry when running the tests, we recommend installing Poetry using pipx rather than the official installer.
In order to run unit tests, run the following command:
Running the end-to-end tests will involve executing Data Factory pipelines in Azure. Ensure you have setup the Azure connection and run:
Running end-to-end tests from your local machine may require additional permissions in Azure. If the tests fail whilst clearing up directories, ensure that you have Storage Blob Data Contributor access applied to your Azure Active Directory subscription. You may also be required to configure the firewall rules for the storage account to whitelist your IP address.
Code quality checks
Pre-commit is used for code quality and linting checks on the project. It can be run using:
(Optional) PySpark development in Databricks
When developing with PySpark, you may wish to either:
- Run scripts locally using a local Spark installation, or
- Run scripts on a Databricks cluster, through Databricks Repos.
To run scripts within a Databricks cluster, you will need to:
- Ensure the stacks-data library is installed on the cluster.
- Add the additional Azure environment variables - the values can be set as per the Data Factory linked service (see adf_linked_services.tf).
- Ensure the user has appropriate permissions for Azure resources required.
Azure Data Factory Development
A core component of the Ensono Stacks Data Platform is Azure Data Factory, which is used for ingest activities, pipeline orchestration and scheduling. When an instance of Data Factory has been deployed, it's intuitive user interface can be used for reviewing, monitoring and editing resources.
While resources can be edited directly through the UI, the approach used in Stacks is to manage all resources through infrastructure-as-code using Terraform. The allows full CI/CD capabilities and control over changes across environments. Developers may use Data Factory's UI to assist in the development of new resources, and then transpose these into the project repository.
The following resource types will typically be added for new data workloads:
|Resource type||Stacks workload types||Defined in||Notes|
|Linked services||Ingest||Refer to Microsoft documentation for up-to-date details on connector types supported by Data Factory, and Terraform documentation for adding custom linked services. Core linked services are added during deployment of shared resources.|
|Datasets||Ingest||Refer to Terraform documentation for adding custom datasets. Core datasets are added during deployment of shared resources.|
|Pipelines||Ingest & Processing||Pipelines are deployed using the Terraform azurerm_resource_group_template_deployment type. These refer to a JSON file containing the pipeline definition. The pipeline definition JSON can be obtained after creating pipelines interactively through the Data Factory UI. If editing a pipeline in the Data Factory UI, click the |
|Triggers||Ingest||Refer to Terraform documentation for adding triggers, e.g. tumbling window triggers.|
Changes to Data Factory resources directly through the UI will lead to them being overwritten when deployment pipelines are next run. Ensure updates are made within the project repository to ensure updates are not lost.
Once you setup your local development environment, you can continue with the Getting Started tutorial by deploying the shared resources.