Home > Docs > Quickstarts > Data Engineer
Data Engineer Quickstart -- Your First Pipeline in 30 Minutes¶
Estimated time: 30 minutes Difficulty: Beginner What you'll build: An end-to-end batch pipeline that ingests a CSV file into the bronze layer via Azure Data Factory, transforms it through the medallion architecture with dbt, and exposes a query-ready gold table through a Fabric SQL endpoint.
Prerequisites¶
- An active Azure subscription with Contributor role on the target resource group
- Azure CLI 2.50+ (
az version) - Bicep CLI 0.25+ (
az bicep version) - Python 3.10+ (
python --version) - dbt-core 1.7+ with the
dbt-databricksadapter (dbt --version) - VS Code with the Azure and dbt extensions (recommended)
- Git 2.x and the CSA-in-a-Box repository cloned locally
Architecture Overview¶
The pipeline follows the medallion architecture -- raw data lands in bronze, is cleansed into silver, and aggregated into gold.
graph LR
CSV["Source CSV"] -->|ADF Copy Activity| Bronze["ADLS Bronze<br/>(raw Delta)"]
Bronze -->|dbt model| Silver["ADLS Silver<br/>(cleansed Delta)"]
Silver -->|dbt model| Gold["ADLS Gold<br/>(aggregated Delta)"]
Gold -->|Fabric SQL Endpoint<br/>or Synapse Serverless| Query["Query & Report"]
style CSV fill:#f5f5f5,stroke:#333
style Bronze fill:#cd7f32,stroke:#333,color:#fff
style Silver fill:#c0c0c0,stroke:#333
style Gold fill:#ffd700,stroke:#333
style Query fill:#4a90d9,stroke:#333,color:#fff Each layer is stored as Delta tables in Azure Data Lake Storage Gen2. dbt handles all transformation logic, and ADF orchestrates the execution.
Step 1: Set Up Your Environment¶
Clone the repo and authenticate:
git clone https://github.com/your-org/csa-inabox.git
cd csa-inabox
az login
az account set --subscription "<YOUR_SUBSCRIPTION_ID>"
Install dbt and its dependencies:
Create or update ~/.dbt/profiles.yml with your Databricks connection:
csa_inabox:
target: dev
outputs:
dev:
type: databricks
catalog: main
schema: dev_shared
host: "{{ env_var('DBX_HOST') }}"
http_path: "{{ env_var('DBX_HTTP_PATH') }}"
token: "{{ env_var('DBX_TOKEN') }}"
threads: 4
Set the required environment variables:
export DBX_HOST="adb-xxxxxxxxxxxx.azuredatabricks.net"
export DBX_HTTP_PATH="/sql/1.0/warehouses/xxxxxxxxxxxxxxxx"
export DBX_TOKEN="dapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Step 2: Deploy Foundation Resources¶
If you have not yet deployed the platform infrastructure, follow Tutorial 01 -- Foundation Platform to provision the Azure Landing Zone, Data Management Landing Zone, and Data Landing Zone.
If your environment is already deployed, verify the key resources exist:
az group show --name rg-csa-dlz-dev --query "{name:name, location:location}" -o table
az storage account show --name stcsadlzdev --query "{name:name, kind:kind}" -o table
az datafactory show --name adf-csa-dlz-dev --resource-group rg-csa-dlz-dev \
--query "{name:name, provisioningState:provisioningState}" -o table
Step 3: Create a Bronze Landing Zone¶
Create the ADLS container, prepare a sample CSV, and upload it:
# Create the landing container
az storage container create \
--account-name stcsadlzdev \
--name landing \
--auth-mode login
# Prepare sample data
mkdir -p temp
cat > temp/sample_customers.csv << 'EOF'
customer_id,first_name,last_name,email,signup_date,region
1,Alice,Johnson,alice.johnson@example.com,2024-01-15,East
2,Bob,Smith,bob.smith@example.com,2024-02-20,West
3,Carol,Williams,carol.williams@example.com,2024-03-10,Central
4,David,Brown,david.brown@example.com,2024-04-05,East
5,Eve,Davis,eve.davis@example.com,2024-05-12,West
EOF
# Upload to ADLS
az storage blob upload \
--account-name stcsadlzdev \
--container-name landing \
--name customers/sample_customers.csv \
--file temp/sample_customers.csv \
--auth-mode login
Step 4: Configure the ADF Pipeline¶
In ADF Studio (or via the deployed Bicep artifacts), configure a linked service, datasets, and a copy activity. If you deployed via Bicep, the linked service ls_adls_csadlz already exists -- see ADF Setup Guide for the full artifact inventory.
Linked service -- points to your ADLS storage account:
{
"name": "ls_adls_csadlz",
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://stcsadlzdev.dfs.core.windows.net/"
}
}
Copy activity pipeline -- ingests CSV from the landing container into the bronze container as Parquet:
{
"name": "pl_ingest_customers_to_bronze",
"properties": {
"activities": [
{
"name": "CopyCustomersToBronze",
"type": "Copy",
"inputs": [{ "referenceName": "ds_landing_csv" }],
"outputs": [{ "referenceName": "ds_bronze_delta" }],
"typeProperties": {
"source": { "type": "DelimitedTextSource" },
"sink": { "type": "ParquetSink" }
}
}
]
}
}
Trigger the pipeline manually from ADF Studio, or via CLI:
az datafactory pipeline create-run \
--resource-group rg-csa-dlz-dev \
--factory-name adf-csa-dlz-dev \
--name pl_ingest_customers_to_bronze
Step 5: Write Your First dbt Model¶
5.1 Create the Silver Model¶
Create domains/shared/dbt/models/silver/slv_customers.sql:
{{ config(materialized='table') }}
with source as (
select * from {{ source('bronze', 'customers') }}
),
cleaned as (
select
cast(customer_id as int) as customer_id,
trim(first_name) as first_name,
trim(last_name) as last_name,
lower(trim(email)) as email,
cast(signup_date as date) as signup_date,
upper(trim(region)) as region,
current_timestamp() as _loaded_at
from source
where customer_id is not null
)
select * from cleaned
5.2 Create the Gold Model¶
Create domains/shared/dbt/models/gold/gld_customers_by_region.sql:
{{ config(materialized='table') }}
with customers as (
select * from {{ ref('slv_customers') }}
)
select
region,
count(*) as customer_count,
min(signup_date) as earliest_signup,
max(signup_date) as latest_signup,
current_timestamp() as _aggregated_at
from customers
group by region
5.3 Add a Schema Test¶
Add the following to domains/shared/dbt/models/silver/schema.yml:
models:
- name: slv_customers
description: "Cleansed customer records from the bronze layer."
columns:
- name: customer_id
tests: [unique, not_null]
- name: email
tests: [unique, not_null]
5.4 Run dbt¶
cd domains/shared/dbt
dbt run --select slv_customers gld_customers_by_region
dbt test --select slv_customers
Expected output
Running with dbt=1.7.x
Found 2 models, 4 tests, 1 source
1 of 2 OK created sql table model dev_shared.slv_customers ......... [OK in 3.21s]
2 of 2 OK created sql table model dev_shared.gld_customers_by_region [OK in 2.45s]
Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
1 of 4 PASS not_null_slv_customers_customer_id ..................... [PASS in 1.02s]
2 of 4 PASS not_null_slv_customers_email ........................... [PASS in 1.05s]
3 of 4 PASS unique_slv_customers_customer_id ....................... [PASS in 1.10s]
4 of 4 PASS unique_slv_customers_email ............................. [PASS in 1.08s]
Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4
Step 6: Query the Gold Layer¶
Option A: Fabric SQL Endpoint¶
If Microsoft Fabric is configured, open the SQL endpoint for your lakehouse:
SELECT region, customer_count, earliest_signup, latest_signup
FROM dev_shared.gld_customers_by_region
ORDER BY customer_count DESC;
Option B: Synapse Serverless SQL¶
SELECT *
FROM OPENROWSET(
BULK 'https://stcsadlzdev.dfs.core.windows.net/gold/customers_by_region/**',
FORMAT = 'DELTA'
) AS gold_data
ORDER BY customer_count DESC;
Expected query results
Step 7: Validate and Next Steps¶
You now have a working end-to-end pipeline:
- Bronze -- Raw CSV data ingested via ADF Copy Activity into ADLS Gen2
- Silver -- Cleansed and typed records produced by a dbt model with schema tests
- Gold -- Region-level aggregation ready for BI consumption
Suggested Next Steps¶
- Add a schedule trigger to run the ADF pipeline on a recurring basis. See ADF Setup Guide -- Trigger Configuration.
- Add more domains by creating new dbt projects under
domains/. See existing domains likesales,finance, andinventoryfor patterns. - Set up data quality gates with dbt tests and Great Expectations. See Great Expectations Tutorial.
- Explore streaming for near-real-time ingestion. See Tutorial 05 -- Streaming Lambda.
- Register assets in Purview for governance and lineage tracking. See Purview Guide.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
az datafactory: command not found | ADF CLI extension not installed | Run az extension add --name datafactory |
dbt SCHEMA_NOT_FOUND error | Target schema missing in catalog | Run CREATE SCHEMA IF NOT EXISTS dev_shared in Databricks |
| ADF copy activity fails with 403 | Storage RBAC not configured | Assign Storage Blob Data Contributor to the ADF managed identity |
dbt test unique fails | Duplicate rows in source CSV | De-duplicate the source or add distinct in the silver model |
ConnectionError in dbt | Wrong DBX_HOST or expired token | Verify env vars match your workspace; regenerate the token |
| OPENROWSET returns 0 rows | Delta log path mismatch | Confirm BULK path matches the dbt materialization folder |
Upload AuthorizationPermissionMismatch | Missing RBAC on ADLS | Assign Storage Blob Data Contributor to your user |