Skip to main content

Datastacks CLI

The Datastacks CLI is a command-line interface for data engineers, built upon the Stacks Data Python library. It's features include:

Using the Datastacks CLI

# Option 1: Run Datastacks CLI using Poetry's interactive shell (recommended for local development)
poetry shell
datastacks --help

# Option 2: Run Datastacks CLI using poetry run (recommended where Poetry shell cannot be used, e.g. CI/CD pipelines)
poetry run datastacks --help

Generating data workloads

Datastacks can be used to generate all the resources required for a new data engineering workload - for example a data ingest or data processing pipeline. This will create all resources required for the workload, based upon templates.

The deployment architecture section shows the workflow for using Datastacks to generate a new workload. The getting started section includes step-by-step instructions on deploying a new ingest or processing workload using Datastacks.

Commands

  • generate: Top-level command for generating resources for a new data workload.
    • ingest: Subcommand to generate a new data ingest workload, using the provided configuration file. A optional flag (--data-quality or -dq) can be included to specify whether to include data quality components in the workload.
    • processing: Subcommand to generate a new data processing workload, using the provided configuration file. A optional flag (--data-quality or -dq) can be included to specify whether to include data quality components in the workload.

Examples

# Activate virtual environment
poetry shell

# Generate resources for an ingest workload
datastacks generate ingest --config="de_workloads/generate_examples/test_config_ingest.yaml"

# Generate resources for an ingest workload, with added data quality steps
datastacks generate ingest --config="de_workloads/generate_examples/test_config_ingest.yaml" --data-quality

# Generate resources for a processing workload
datastacks generate processing --config="de_workloads/generate_examples/test_config_processing.yaml"

# Generate resources for a processing workload, with added data quality steps
datastacks generate processing --config="de_workloads/generate_examples/test_config_processing.yaml" --data-quality

Configuration

In order to generate a new data engineering workload the Datastacks CLI takes a path to a config file. This config file should be YAML format, and contain configuration values as specified in the table below. Sample config files are included in the de_workloads/generate_examples folder.

All workloads

Config fieldDescriptionRequired?FormatDefault valueExample value
pipeline_descriptionDescription of the pipeline to be created. Will be used for the Data Factory pipeline description.YesStringn/a"Ingest from demo Azure SQL database using ingest config file."
ado_variable_groups_nonprodList of required variable groups in non-production environment.YesList[String]n/a- amido-stacks-de-pipeline-nonprod
- stacks-credentials-nonprod-kv
ado_variable_groups_prodList of required variable groups in production environment.YesList[String]n/a- amido-stacks-de-pipeline-prod
- stacks-credentials-prod-kv
default_arm_deployment_modeDeployment mode for terraform.NoString"Incremental"Incremental
stacks_data_package_versionVersion of the stacks-data Python package in PyPi to install on the job cluster.NoString (SemVer pattern)Latest available package at the time of generation0.1.2

Ingest workloads

Config fieldDescriptionRequired?FormatDefault valueExample value
dataset_nameDataset name, used to derive pipeline and linked service names, e.g. AzureSql_Example.YesStringn/aazure_sql_demo
data_source_password_key_vault_secret_nameSecret name of the data source password in Key Vault.YesStringn/asql-password
data_source_connection_string_variable_nameVariable name for the connection string.YesStringn/asql_connection
data_source_typeData source type.YesString

Allowed values1:
"azure_sql"
n/aazure_sql
bronze_containerName of container for landing ingested data.NoStringrawraw
key_vault_linked_service_nameName of the Key Vault linked service in Data Factory.NoStringls_KeyVaultls_KeyVault
trigger_startStart datetime for Data Factory pipeline trigger.NoDatetimen/a2010-01-01T00:00:00Z
trigger_endDatetime to set as end time for pipeline trigger.NoDatetimen/a2011-12-31T23:59:59Z
trigger_frequencyFrequency for the Data Factory pipeline trigger.NoString

Allowed values:
"Minute"
"Hour"
"Day"
"Week"
"Month"
"Month"Month
trigger_intervalInterval value for the Data Factory pipeline trigger.NoInteger11
trigger_delayDelay between Data Factory pipeline triggers, formatted HH:mm:ssNoString"02:00:00"02:00:00
window_start_defaultDefault window start date in the Data Factory pipeline.NoDate"2010-01-01"2010-01-01
window_end_defaultDefault window end date in the Data Factory pipeline.NoDate"2010-01-31"2010-01-31

Processing workloads

Config fieldDescriptionRequired?FormatDefault valueExample value
pipeline_nameName of the data pipeline / workload.YesStringn/aprocessing_demo

Footnotes

  1. Additional data source types will be supported in future - see ingest data source types.