Home > Docs > ADF Setup

Azure Data Factory Setup Guide¶

Note

Quick Summary: Deploy and manage ADF pipeline artifacts for the CSA-in-a-Box platform, including linked services, datasets, pipelines, triggers, and CI/CD integration with Purview lineage.

📑 Table of Contents¶

🏗️ Architecture
📎 Prerequisites
📁 Pipeline Artifacts
📦 Deployment
⚙️ Linked Service Configuration
⚙️ Trigger Configuration
⚙️ Pipeline Parameters
🔄 CI/CD Integration
📊 Purview Lineage
🔧 Troubleshooting

🏗️ Architecture¶

ADF orchestrates the batch data pipeline:

graph TD
    Landing["Landing Container<br/>(CSV/Parquet)"] --> Ingest["pl_ingest_to_bronze<br/>⏰ hourly trigger"]
    Ingest --> Bronze["Bronze<br/>(raw Delta in ADLS)"]
    Bronze --> Orch["pl_medallion_orchestration<br/>⏰ daily trigger"]
    Orch --> dbt["pl_run_dbt_models<br/>Bronze → Silver → Gold"]
    Orch --> test["dbt test<br/>(quality gates)"]
    Orch --> nb["Databricks notebooks<br/>(Spark transforms)"]
    dbt --> SilverGold["Silver / Gold<br/>(validated Delta in ADLS)"]
    test --> SilverGold
    nb --> SilverGold

📎 Prerequisites¶

ADF instance deployed via Bicep (deploy/bicep/DLZ/main.bicep)
ADLS Gen2 storage with landing, bronze, silver, gold containers
Databricks workspace with a running cluster or SQL warehouse
Key Vault with connection secrets stored
Azure CLI >= 2.50 with the datafactory extension

az extension add --name datafactory

📁 Pipeline Artifacts¶

All ADF definitions live under domains/*/pipelines/adf/:

domains/shared/pipelines/adf/
  linkedServices/
    ls_adls_gen2.json          # ADLS Gen2 via managed identity
    ls_databricks.json         # Databricks workspace
  datasets/
    ds_source_delimited.json   # Parameterized CSV source
    ds_adls_parquet.json       # Parquet format
    ds_adls_delta.json         # Delta Lake format
  pl_ingest_to_bronze.json     # Raw file ingestion
  pl_medallion_orchestration.json  # Bronze -> Silver -> Gold
  pl_run_dbt_models.json       # dbt via Databricks
  triggers/
    tr_daily_medallion.json    # Daily 06:00 UTC
    tr_hourly_ingest.json      # Hourly ingestion

domains/sales/pipelines/adf/
  pl_sales_daily_load.json     # Sales-specific daily pipeline

📦 Deployment¶

Automated deployment (recommended)¶

The deploy-adf.sh script deploys all artifacts in dependency order:

# Dry run — shows what would be deployed
./scripts/deploy/deploy-adf.sh \
    --factory-name csadlzdevdf \
    --resource-group rg-csadlz-dev \
    --dry-run

# Deploy for real
./scripts/deploy/deploy-adf.sh \
    --factory-name csadlzdevdf \
    --resource-group rg-csadlz-dev

Or via Make:

make deploy-adf FACTORY_NAME=csadlzdevdf RESOURCE_GROUP=rg-csadlz-dev

Important

Deployment order (handled automatically): 1. Linked Services (connections to ADLS, Databricks, Key Vault) 2. Datasets (parameterized data shapes) 3. Pipelines (orchestration logic) 4. Triggers (schedules — started automatically after creation)

Manual deployment (Azure Portal)¶

Open your Data Factory in the Azure Portal
Go to Author > Manage > Linked Services
Click + New and import each ls_*.json file
Repeat for Datasets, Pipelines, and Triggers

⚙️ Linked Service Configuration¶

ls_adls_gen2 — Azure Data Lake Storage¶

Uses the ADF managed identity for authentication (no keys needed).

Required setup:

Assign Storage Blob Data Contributor to the ADF managed identity on the ADLS storage account
The ADF Bicep module outputs managedIdentityPrincipalId for this purpose

ls_databricks — Databricks Workspace¶

Uses Key Vault to retrieve the Databricks access token.

Required setup:

Generate a personal access token (PAT) in Databricks
Store it in Key Vault as secret databricks-token
Update the linked service JSON if your workspace URL differs

⚙️ Trigger Configuration¶

Trigger	Schedule	Pipeline	Purpose
`tr_hourly_ingest`	Every hour	`pl_ingest_to_bronze`	Pick up new landing files
`tr_daily_medallion`	Daily 06:00 UTC	`pl_medallion_orchestration`	Full Bronze→Silver→Gold refresh

⌨️ Managing triggers¶

# Stop a trigger
az datafactory trigger stop \
    --factory-name csadlzdevdf \
    --resource-group rg-csadlz-dev \
    --trigger-name tr_daily_medallion

# Start a trigger
az datafactory trigger start \
    --factory-name csadlzdevdf \
    --resource-group rg-csadlz-dev \
    --trigger-name tr_daily_medallion

⚙️ Pipeline Parameters¶

pl_ingest_to_bronze¶

Parameter	Type	Default	Description
`sourceContainer`	string	`landing`	Source ADLS container
`sourceFolder`	string	—	Folder within the source container
`targetContainer`	string	`bronze`	Target ADLS container
`targetFolder`	string	—	Folder within the target container

pl_medallion_orchestration¶

Parameter	Type	Default	Description
`environment`	string	`dev`	Target environment (dev/staging/prod)
`fullRefresh`	bool	`false`	Force full rebuild (skip incremental)
`alertWebhookUrl`	string	—	Teams webhook for failure alerts

pl_run_dbt_models¶

Parameter	Type	Default	Description
`dbtCommand`	string	`run`	dbt command: run, test, build, seed
`dbtTarget`	string	`dev`	dbt profile target
`dbtModels`	string	—	Model selector (e.g., `+gld_revenue`)
`fullRefresh`	bool	`false`	Pass `--full-refresh` to dbt

🔄 CI/CD Integration¶

The deploy workflow (.github/workflows/deploy.yml) deploys Bicep infrastructure. After infrastructure is up, run the ADF deployment:

# Add to deploy.yml after DLZ Bicep deployment
- name: Deploy ADF pipelines
  run: |
      ./scripts/deploy/deploy-adf.sh \
        --factory-name ${{ env.FACTORY_NAME }} \
        --resource-group ${{ env.RESOURCE_GROUP }}

📊 Purview Lineage¶

ADF natively pushes lineage to Microsoft Purview when configured. The Bicep module accepts a purviewAccountId parameter that wires this up automatically. See Purview integration in the architecture docs.

🔧 Troubleshooting¶

See the ADF section in TROUBLESHOOTING.md.

See also:

← Previous: Documentation home
→ Next: Self-Hosted IR
⌂ Index: Documentation home