Data Ingest Pipeline Deployment
This section provides an overview of generating a new data ingest pipeline workload and deploying it into a Ensono Stacks Data Platform, using the Datastacks CLI.
This guide assumes the following are in place:
- A deployed Ensono Stacks Data Platform
- Development environment set up
- Deployed shared resources
- A data source to ingest from. The steps below are based on using the Azure SQL example data source
This process will deploy the following resources into the project:
- Azure Data Factory resources (defined in Terraform / ARM)
- Linked service
- Dataset
- Pipeline
- Trigger
- Data ingest config files (JSON)
- Azure DevOps CI/CD pipeline (YAML)
- (optional) Spark job and config file for data quality tests (Python)
- Template unit tests (Python)
- Template end-to-end tests (Python, Behave)
Data source pre-requisites
Details required for connecting to the data source will need to be stored securely (i.e. not in the source code) and to be referenced dynamically by the deployment pipeline. This approach also allows for different versions of the data source to be used in different environments (for example non-prod / prod versions). The examples below require the following details to be set for the Azure SQL sample database in each environment:
Azure DevOps variable
Azure DevOps variables will be accessed dynamically during deployments so is used for details needed to create the linked service in Data Factory.
sql_connection
: connection string for the database, for exampleData Source=amidostacksdeveuwdesql.database.windows.net;Initial Catalog=exampledb;User ID=user;Integrated Security=False;Encrypt=True;Connection Timeout=30;
Key Vault secret
The password will need to be accessed dynamically by Data Factory on each connection, therefore should be stored in the Key Vault linked to the factory.
sql-password
: password to use with the connection string
Step 1: Create feature branch
Before creating a new workload using Datastacks, open the project locally and create a new branch for the workload being created, e.g.:
git checkout -b feat/my-new-ingest-pipeline
Step 2: Prepare the Datastacks config file
Datastacks requires a YAML config file for generating a new ingest workload - see Datastacks configuration for further details.
Create a new YAML file and populate the values relevant to your new ingest pipeline. The example below will create an ingest workload named Ingest_AzureSql_MyNewExample, and connect using the data source connection details as specified in Data source pre-requisites above.
#######################
# Required parameters #
#######################
# Data pipeline configurations
dataset_name: AzureSql_MyNewExample
pipeline_description: "Ingest from demo Azure SQL database using ingest config file."
data_source_type: azure_sql
data_source_password_key_vault_secret_name: sql-password
data_source_connection_string_variable_name: sql_connection
# Azure DevOps configurations
ado_variable_groups_nonprod:
- amido-stacks-de-pipeline-nonprod
- stacks-credentials-nonprod-kv
ado_variable_groups_prod:
- amido-stacks-de-pipeline-prod
- stacks-credentials-prod-kv
#######################
# Optional parameters #
#######################
# Workload config
window_start_default: 2010-01-01
window_end_default: 2010-01-31
Step 3: Generate project artifacts using Datastacks
Use the Datastacks CLI to generate the artifacts for the new workload, using the prepared config file (replacing path_to_config_file/my_config.yaml
with the appropriate path).:
# Activate virtual environment
poetry shell
# Generate resources for an ingest pipeline (without data quality steps)
datastacks generate ingest --config="path_to_config_file/my_config.yaml"
# Generate resources for an ingest pipeline (with added data quality steps)
datastacks generate ingest --config="path_to_config_file/my_config.yaml" --data-quality
This will add new project artifacts for the workload under de_workloads/ingest/Ingest_AzureSql_MyNewExample
, based on the ingest workload templates. Review the resources that have been generated.
The default ingest workload generated by Datastacks is based upon ingesting from an Azure SQL data source. For the purposes of the getting started example to you can leave the resources generated as they are. See ingest data source types for further information on adapting the workload for other data source types.
The default ingest workload contains an example tumbling window trigger, which defaults to a 'Stopped' state. This is defined in the Terraform resource in data_factory/adf_triggers/tf
, and can be modified based on your requirements.
Step 4: Update ingest configuration
Configuration of the data that the workload will ingest from the source is specified in the file in the workload's config/ingest_sources/ingest_config.json
file - see data ingest configuration for further details on this file. For the example data source, update the contents of the file with the following:
{
"data_source_name": "Ingest_AzureSql_MyNewExample",
"data_source_type": "azure_sql",
"enabled": true,
"ingest_entities": [
{
"version": 1,
"display_name": "movies.movies_metadata",
"enabled": true,
"schema": "movies",
"table": "movies_metadata",
"columns": "[adult], [belongs_to_collection], [budget], [genres], [homepage], [id], [imdb_id], [original_language], [original_title], [overview], [popularity], [poster_path], [production_companies], [production_countries], [release_date], [revenue], [runtime], [spoken_languages], [status], [tagline], [title], [video], [vote_average], [vote_count]",
"load_type": "full",
"delta_date_column": null,
"delta_upsert_key": null
},
{
"version": 1,
"display_name": "movies.ratings_small",
"enabled": true,
"schema": "movies",
"table": "ratings_small",
"columns": "[userId], [movieId], [rating], [timestamp]",
"load_type": "full",
"delta_date_column": null,
"delta_upsert_key": null
},
{
"version": 1,
"display_name": "movies.keywords",
"enabled": true,
"schema": "movies",
"table": "keywords",
"columns": "[id], [keywords]",
"load_type": "full",
"delta_date_column": null,
"delta_upsert_key": null
},
{
"version": 1,
"display_name": "movies.links",
"enabled": true,
"schema": "movies",
"table": "links",
"columns": "[movieId], [imdbId], [tmdbId]",
"load_type": "full",
"delta_date_column": null,
"delta_upsert_key": null
}
]
}
Step 5: Update end-to-end tests
The end-to-end tests are designed to run the ingest pipeline in a controlled fashion to ensure it functions as expected. Open the test feature file for the workload (tests/end_to_end/features/azure_data_ingest.feature
) and update the parameters to reflect the data entities expected to be ingested. In our example, we will use the entities specified in the config file above, i.e.:
|{"window_start" : "2010-01-01", "window_end": "2010-01-31"}|["movies.keywords", "movies.links", "movies.movies_metadata", "movies.ratings_small"]|
Step 6: Deploy new workload in non-production environment
The generated workload contains a YAML file containing a template Azure DevOps CI/CD pipeline for the workload, named de-ingest-ado-pipeline.yaml
. This should be added as the definition for a new pipeline in Azure DevOps.
- Sign-in to your Azure DevOps organization and go to your project.
- Go to Pipelines, and then select New pipeline.
- Name the new pipeline to match the name of your new workload, e.g.
de-ingest-azuresql-mynewexample
. - For the pipeline definition, specify the YAML file in the project repository feature branch (e.g.
de-ingest-ado-pipeline.yaml
) and save. - The new pipeline will require access to any Azure DevOps pipeline variable groups specified in the datastacks config file. Under each variable group, go to 'Pipeline permissions' and add the new pipeline.
- Run the new pipeline.
Running this pipeline in Azure DevOps will deploy the artifacts into the non-production (nonprod) environment and run tests. If successful, the generated resources will now be available in the nonprod Ensono Stacks environment.
Step 7: Review deployed resources
If successful, the new resources will now be deployed into the non-production resource group in Azure - these can be viewed through the Azure Portal or CLI.
The Azure Data Factory resources can be viewed through the Data Factory UI. You may also wish to run/debug the newly generated pipeline from here (see Microsoft documentation).
The structure of the data platform and Data Factory resources are defined in the project's code repository, and deployed through the Azure DevOps pipelines. Changes to Data Factory resources directly through the UI will lead to them being overwritten when deployment pipelines are next run. See Data Factory development quickstart for further information on updating Data Factory resources.
Continue to make any further amendments required to the new workload, re-running the DevOps pipeline as required. If including data quality checks, update the (ingest_dq
) file in the repository with details of checks required on the data (see data quality configuration for further details).
Step 8: Deploy new workload in further environments
In the example pipeline templates:
- Deployment to the non-production (nonprod) environment is triggered on a feature branch when a pull request is open
- Deployment to the production (prod) environment is triggered on merging to the
main
branch, followed by manual approval of the release step.
It is recommended in any data platform that processes for deploying and releasing across environments should be agreed and documented, ensuring sufficient review and quality assurance of any new workloads. The template CI/CD pipelines provided are based upon two platform environments (nonprod and prod) - but these may be amended depending upon the specific requirements of your project and organisation.
Next steps
Now you have ingested some data into the bronze data lake layer, you can generate a data processing pipeline to transform and model the data.