Skip to content

Reference Architecture — Data Flow (Medallion)

TL;DR: Bronze stores raw, silver stores cleaned-and-conformed, gold stores business-ready. Every transition is a dbt (or Spark) job with tests, version control, and lineage to Purview. Don't skip layers; don't denormalize bronze.

The problem

A real analytics platform ingests from dozens of source systems with different schemas, qualities, freshnesses, and retention requirements — and serves dozens of consumer surfaces (BI, APIs, ML, AI agents, regulators) with conflicting needs. Without a strict layered approach, you end up with a god-mode pipeline that nobody can reason about and a single bad deploy breaks every dashboard.

Architecture

flowchart LR
    subgraph Sources[Source Systems]
        S1[OLTP DBs<br/>SQL Server, Cosmos, RDS]
        S2[SaaS APIs<br/>Salesforce, ServiceNow]
        S3[Files<br/>CSV/Parquet/JSON drops]
        S4[Streams<br/>IoT, Kafka, Event Hubs]
        S5[On-prem<br/>Teradata, Hadoop, mainframe]
    end

    subgraph Ingest[Ingestion]
        ADF[Azure Data Factory<br/>+ Self-Hosted IR]
        EH[Event Hubs<br/>+ Capture]
        ASA[Stream Analytics<br/>or Fabric RTI]
    end

    subgraph Bronze[🥉 BRONZE — Raw / Immutable]
        B1[(ADLS Gen2<br/>Delta tables<br/>partitioned by ingest_date)]
        B2[Schema = source schema<br/>+ ingestion metadata]
        B3[Retention: 7-90 days hot,<br/>archive cold]
    end

    subgraph Silver[🥈 SILVER — Cleaned / Conformed]
        Sv1[(Delta tables<br/>SCD2 where needed)]
        Sv2[Type-cast, deduped,<br/>nulls explicit, PII tagged]
        Sv3[Conformed dimensions<br/>+ surrogate keys]
        Sv4[Quality tests:<br/>Great Expectations<br/>+ dbt tests]
    end

    subgraph Gold[🥇 GOLD — Business / Aggregated]
        G1[(Star/snowflake schemas<br/>or wide flat tables)]
        G2[Business metrics<br/>+ KPIs]
        G3[ML feature tables]
        G4[Aggregates for BI]
    end

    subgraph Serve[Serving]
        BI[Power BI<br/>Direct Lake / Import]
        DAB[Data API Builder<br/>REST + GraphQL]
        AOAI[AI Agents<br/>RAG/GraphRAG]
        ML[Azure ML<br/>training + inference]
        Export[Reverse-ETL<br/>Salesforce, Sheets]
    end

    subgraph Governance[Cross-Cutting]
        Purv[Microsoft Purview<br/>catalog + lineage]
        KV[Key Vault<br/>secrets]
        LA[Log Analytics<br/>+ App Insights]
        Contracts[YAML data product<br/>contracts in git]
    end

    S1 --> ADF
    S2 --> ADF
    S3 --> ADF
    S4 --> EH
    S4 --> ASA
    S5 --> ADF

    ADF --> B1
    EH --> B1
    ASA --> B1

    B1 -- dbt run --> Sv1
    Sv1 -- dbt run --> G1

    G1 --> BI
    G1 --> DAB
    G1 --> AOAI
    G1 --> ML
    G1 --> Export

    Purv -. scans .-> B1
    Purv -. scans .-> Sv1
    Purv -. scans .-> G1
    Contracts -. validates .-> Sv1
    Contracts -. validates .-> G1
    KV -. secrets .-> ADF
    KV -. secrets .-> ASA
    KV -. secrets .-> AOAI
    LA -. logs .-> ADF
    LA -. logs .-> ASA
    LA -. logs .-> Sv1

Layer rules

Bronze (raw / immutable)

Rule Why
Schema = source schema, plus 4 metadata cols (_ingest_ts, _source, _batch_id, _file_uri) If source changes, you have what you need to replay
Append-only, never updated in place Bronze is your backup-of-record; deletes/updates compromise replay
Partitioned by ingest_date (not source date) Operational pruning matches operational queries
Stored as Delta (not raw Parquet/CSV) ACID + time travel + cheap scans
No PII redaction at this layer Source-system fidelity; redaction happens entering silver
Lifecycle: 7-90 days hot tier, then cool/archive Hot Bronze is expensive; most queries hit silver/gold

Silver (cleaned / conformed)

Rule Why
Types are correct (no string-encoded ints, no null strings) Downstream code can trust the schema
Nulls are explicit and modeled (is_known, unknown_reason) Telling "missing" from "actively unknown" matters for ML and reporting
PII is tagged (Purview classification or column-level tag) Compliance can audit access
SCD2 where business semantics demand it (customers, products, orgs) Historical reporting works
Conformed dimensions (one canonical dim_customer, not 5) Joining across domains is possible
Surrogate keys generated, natural keys preserved Schema changes upstream don't break downstream joins
Quality tests are mandatory (not "we'll add them later") Silver is what gold trusts

Gold (business / aggregated)

Rule Why
Business-friendly column names, not source-system jargon Analysts reading SQL shouldn't need a dictionary
Wide flat or star schema, optimized for the consumer Direct Lake / Power BI prefers star; ML often prefers wide flat
Definitions match official metric definitions (and link to them in dbt docs) "Revenue" means the same thing in every dashboard
Aggregations are explicit (daily_sales_by_region, not sales) Naming reveals the grain
Slowly-changing where business semantics demand, snapshotted otherwise Predictable rebuild semantics
Each gold table maps to ≥ 1 published data product contract Consumers can rely on schema + SLA

Where each Azure service fits

Service Layer Purpose
Azure Data Factory Sources → Bronze Batch ingestion, orchestration, self-hosted IR for on-prem
Event Hubs + Capture Sources → Bronze Streaming ingestion with auto-archive to ADLS
Stream Analytics / Fabric RTI Sources → Bronze (or directly to Silver for streaming gold) Stream processing, windowing, anomaly detection
ADLS Gen2 Bronze, Silver, Gold Object store for all Delta tables
Delta Lake Bronze, Silver, Gold Storage format — ACID + time travel + Z-order
Databricks Silver, Gold Spark-based dbt runs, complex transformations, ML feature engineering
Synapse Spark / Serverless SQL Silver, Gold Alternative compute, especially for ad-hoc analyst SQL over Bronze
Microsoft Fabric Lakehouse Silver, Gold Strategic forward path — Direct Lake for BI, KQL for streaming
Microsoft Purview Cross-cutting Catalog + lineage + classification + access
Key Vault Cross-cutting Secrets — never in pipelines or notebooks
Log Analytics + App Insights Cross-cutting Pipeline runs, query perf, error rates
Power BI / Fabric Gold → Serve BI dashboards, semantic models
Data API Builder Gold → Serve Auto-generated REST + GraphQL over gold tables
Azure OpenAI + AI Search Gold → Serve RAG / agents grounded in gold

Lineage

Every transformation runs in dbt (with the exception of Spark-only workloads which run in Databricks notebooks). dbt + Purview together give you:

  • Column-level lineage from bronze → silver → gold
  • "What does this column mean" docs auto-generated from meta and description
  • Upstream/downstream impact analysis when planning a schema change
  • Data freshness SLAs automated as dbt source freshness

See ADR 0013 — dbt as Canonical Transformation for the rationale.

Trade-offs

Why this pattern wins

  • Each layer has one job → easier to reason about, easier to debug
  • Bronze immutability = full replay any time
  • Silver conformance = downstream consumers trust their inputs
  • Gold business definitions = "the report says X" matches "the API returns X"
  • dbt + tests + Purview = real lineage, not Confluence diagrams

⚠️ What you give up

  • Three layers means 3× storage. Mitigate with tier'd lifecycle on bronze.
  • Latency from raw to gold can be 15-60 minutes for batch. For sub-minute, use a separate streaming-gold path (Fabric RTI / KQL DB).
  • Initial setup is heavier than "one giant ETL job." Pays back at scale 2; doesn't pay back at scale 0.5.
  • Discipline is required. Skipping silver "just for one report" creates the spaghetti you tried to avoid.

Variants

Scenario Variant
Streaming-first Bronze becomes Event Hub Capture; silver/gold built in Fabric RTI (Eventhouse / KQL DB). See Patterns — Streaming & CDC
CDC-heavy Add a CDC service (Debezium → Event Hubs, or Synapse Link for SQL) before Bronze; bronze stores the changelog as Delta CDC
ML-heavy Add a Feature Store (Databricks FS or Azure ML FS) between Silver and Gold; ML training reads from the feature store, not gold directly
AI-heavy Gold tables get vectorized into AI Search indexes; agents/RAG read from indexes, not gold tables directly
Regulated (HIPAA, PCI) Same pattern; add encryption-at-rest with CMK, column-level encryption for sensitive cols, separate "PII silver" with stricter RBAC