Medallion Architecture Best Practices¶

Bronze → Silver → Gold medallion flow showing source ingestion, validated and conformed Silver tables, business-ready Gold marts, and Great Expectations quality gates at each transition

Overview¶

The medallion architecture (also called multi-hop architecture) organizes a lakehouse into three progressive layers of data quality: Bronze (raw), Silver (cleansed), and Gold (analytics-ready). Each layer serves a distinct purpose in transforming raw source data into trusted, governed data products.

Why it matters for CSA-in-a-Box:

Data quality — errors are caught and corrected at well-defined boundaries
Governance — lineage is traceable from source to consumption
Reusability — Silver assets serve multiple Gold use cases
Auditability — raw records are preserved for compliance and replay

Architecture Diagram¶

flowchart LR
    subgraph Sources
        S1[ERP / SAP]
        S2[CRM / Dynamics]
        S3[APIs / Events]
        S4[Files / SFTP]
    end

    subgraph Bronze["Bronze — Raw"]
        B1[Append-only ingestion]
        B2[Schema-on-read]
        B3[Full fidelity]
    end

    subgraph Silver["Silver — Cleansed"]
        SV1[Deduplication]
        SV2[Type casting & conforming]
        SV3[SCD Type 2]
    end

    subgraph Gold["Gold — Analytics-Ready"]
        G1[Star schemas]
        G2[Aggregated metrics]
        G3[Data products]
    end

    subgraph Consumers
        C1[Power BI]
        C2[ML Models]
        C3[APIs / Apps]
    end

    S1 & S2 & S3 & S4 --> B1 & B2 & B3
    B1 & B2 & B3 -->|Quality Gates| SV1 & SV2 & SV3
    SV1 & SV2 & SV3 -->|Quality Gates| G1 & G2 & G3
    G1 & G2 & G3 --> C1 & C2 & C3

What this looks like in Azure¶

Azure Storage account Overview blade — Essentials panel showing Standard performance, LRS replication, StorageV2 (general purpose v2) account kind, and the Storage browser / Data storage / Containers left-nav. ADLS Gen2 sits on top of this account; Bronze, Silver, and Gold land in three containers within it.

Bronze Layer Best Practices¶

The Bronze layer is the landing zone — an exact, append-only copy of source data with added ingestion metadata. It preserves full fidelity so upstream issues can always be replayed.

Naming Convention¶

brz_{source}_{entity}

Examples:
  brz_sap_sales_orders
  brz_dynamics_contacts
  brz_api_weather_readings

Core Principles¶

Append-only ingestion — never update or delete rows in Bronze.
No transformations — store data exactly as received.
Schema-on-read — use flexible types (STRING, VARIANT, JSON) when source schemas are unstable.
Metadata columns — every Bronze table includes _metadata columns for lineage.

Required Metadata Columns¶

Column	Type	Description
`_ingested_at`	`TIMESTAMP`	UTC timestamp when the record was ingested
`_source_file`	`STRING`	Source file path, API endpoint, or topic name
`_batch_id`	`STRING`	Unique identifier for the ingestion batch/run
`_ingested_date`	`DATE`	Partition key derived from `_ingested_at`

Partitioning¶

Partition Bronze tables by _ingested_date. This enables efficient time-travel queries and simplifies retention management.

Retention Policy¶

Compliance Requirement

Retain raw Bronze data for a minimum of 7 years to satisfy audit and regulatory obligations. Use lifecycle policies to move older partitions to cold/archive storage tiers.

Do / Don't¶

✅ Do	❌ Don't
Store data exactly as received from source	Apply business logic or filtering
Add `_metadata` columns for lineage	Modify source column names or types
Partition by `_ingested_date`	Partition by business keys
Use append-only writes	Overwrite or upsert rows
Keep raw data for 7+ years	Delete raw data after Silver processing
Use schema-on-read for volatile sources	Enforce strict schemas on unstable sources

Example: dbt Bronze Model¶

-- models/bronze/brz_sap_sales_orders.sql

{{ config(
    materialized='incremental',
    incremental_strategy='append',
    partition_by={
        "field": "_ingested_date",
        "data_type": "date",
        "granularity": "day"
    },
    tags=['bronze', 'sap']
) }}

SELECT
    -- Preserve all source columns as-is
    *,

    -- Metadata columns
    CURRENT_TIMESTAMP()                       AS _ingested_at,
    '{{ var("source_file", "unknown") }}'     AS _source_file,
    '{{ invocation_id }}'                     AS _batch_id,
    CURRENT_DATE()                            AS _ingested_date

FROM {{ source('sap_raw', 'sales_orders') }}

{% if is_incremental() %}
WHERE _loaded_at > (SELECT MAX(_ingested_at) FROM {{ this }})
{% endif %}

Silver Layer Best Practices¶

The Silver layer cleanses, conforms, and deduplicates Bronze data into an enterprise-aligned model. Silver tables are the single source of truth for a given entity.

Naming Convention¶

slv_{domain}_{entity}

Examples:
  slv_sales_orders
  slv_customers_current
  slv_products_scd

Core Principles¶

Data cleansing — handle nulls, trim strings, cast types, standardize formats.
Deduplication — resolve duplicates using deterministic keys and ordering.
SCD Type 2 — track historical changes for slowly-changing dimensions.
Conform to enterprise model — align column names, data types, and reference data to organization-wide standards.
Quality gates — validate data between Bronze → Silver with automated tests.

Data Cleansing Patterns¶

Pattern	Technique
Null handling	`COALESCE(field, default_value)` or reject with test
Deduplication	`ROW_NUMBER() OVER (PARTITION BY pk ORDER BY _ingested_at DESC)`
Type casting	Explicit `CAST()` / `SAFE_CAST()`
String normalization	`TRIM(UPPER(field))`
Date standardization	Convert all dates to `UTC` / `ISO 8601`

SCD Type 2 Pattern¶

slv_{domain}_{entity}_scd

Columns:
  {natural_key}        -- business key
  {attributes}         -- tracked columns
  _valid_from          -- effective start (inclusive)
  _valid_to            -- effective end (exclusive), NULL = current
  _is_current          -- BOOLEAN flag for convenience
  _hash_diff           -- hash of tracked columns for change detection

Quality Gates (Bronze → Silver)¶

Use dbt tests or Great Expectations to enforce quality before data enters Silver:

not_null on primary keys
unique on primary keys
accepted_values on enumerations
relationships for referential integrity
Custom row-count variance checks (alert if delta > 20%)

Do / Don't¶

✅ Do	❌ Don't
Deduplicate using deterministic logic	Randomly pick one of N duplicates
Apply SCD Type 2 for dimensions that change	Overwrite dimension history
Enforce data types explicitly	Rely on implicit type coercion
Align to enterprise naming standards	Invent new naming per project
Run quality tests before writing Silver	Skip validation and "fix later"
Document cleansing rules in dbt YAML	Bury logic in undocumented SQL

Example: dbt Silver Incremental Model with Dedup and SCD2¶

-- models/silver/slv_sales_orders.sql

{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    partition_by={
        "field": "order_date",
        "data_type": "date",
        "granularity": "month"
    },
    tags=['silver', 'sales']
) }}

WITH source AS (
    SELECT * FROM {{ ref('brz_sap_sales_orders') }}
    {% if is_incremental() %}
    WHERE _ingested_at > (SELECT MAX(_ingested_at) FROM {{ this }})
    {% endif %}
),

deduplicated AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY _ingested_at DESC
        ) AS _rn
    FROM source
)

SELECT
    -- Business keys
    CAST(order_id AS INT64)                         AS order_id,
    CAST(customer_id AS INT64)                      AS customer_id,

    -- Cleansed attributes
    TRIM(UPPER(order_status))                       AS order_status,
    CAST(order_date AS DATE)                        AS order_date,
    COALESCE(CAST(order_total AS NUMERIC), 0)       AS order_total,
    COALESCE(CAST(currency_code AS STRING), 'USD')  AS currency_code,

    -- Lineage
    _ingested_at,
    _batch_id,
    CURRENT_TIMESTAMP()                             AS _transformed_at

FROM deduplicated
WHERE _rn = 1

-- models/silver/slv_customers_scd.sql

{{ config(
    materialized='incremental',
    unique_key='_surrogate_key',
    incremental_strategy='merge',
    tags=['silver', 'scd2', 'customers']
) }}

WITH incoming AS (
    SELECT
        customer_id,
        customer_name,
        customer_segment,
        region,
        {{ dbt_utils.generate_surrogate_key([
            'customer_id', 'customer_name', 'customer_segment', 'region'
        ]) }}                                           AS _hash_diff,
        _ingested_at
    FROM {{ ref('brz_dynamics_contacts') }}
    {% if is_incremental() %}
    WHERE _ingested_at > (SELECT MAX(_ingested_at) FROM {{ this }} WHERE _is_current)
    {% endif %}
),

{% if is_incremental() %}
existing AS (
    SELECT * FROM {{ this }} WHERE _is_current = TRUE
),

changes AS (
    SELECT i.*
    FROM incoming i
    LEFT JOIN existing e ON i.customer_id = e.customer_id
    WHERE e.customer_id IS NULL       -- new record
       OR e._hash_diff != i._hash_diff -- changed record
)
{% else %}
changes AS (
    SELECT * FROM incoming
)
{% endif %}

SELECT
    {{ dbt_utils.generate_surrogate_key([
        'customer_id', '_ingested_at'
    ]) }}                                               AS _surrogate_key,
    customer_id,
    customer_name,
    customer_segment,
    region,
    _hash_diff,
    _ingested_at                                        AS _valid_from,
    CAST(NULL AS TIMESTAMP)                             AS _valid_to,
    TRUE                                                AS _is_current
FROM changes

Gold Layer Best Practices¶

The Gold layer delivers analytics-ready data products — star schemas, pre-aggregated metrics, and curated datasets designed for specific business domains.

Naming Convention¶

gld_{domain}_{metric_or_entity}

Examples:
  gld_sales_revenue_daily
  gld_finance_gl_summary
  gld_customers_dim
  gld_sales_fact

Core Principles¶

Star schema / dimensional modeling — follow Kimball methodology with fact and dimension tables.
Pre-aggregated — compute metrics at the grain consumers need; don't force BI tools to aggregate billions of rows.
Data product contracts — define schema, SLA, ownership, and freshness guarantees.
Materialization decisions — choose based on query patterns and freshness requirements.

Data Product Contract¶

Every Gold table should have a documented contract:

# models/gold/gld_sales_revenue_daily.yml
models:
    - name: gld_sales_revenue_daily
      description: "Daily aggregated sales revenue by region and product category"
      meta:
          owner: "data-engineering@company.com"
          sla_freshness: "< 4 hours"
          grain: "date × region × product_category"
          consumers: ["Power BI Sales Dashboard", "Finance Reporting"]
          classification: "internal"
      columns:
          - name: revenue_date
            description: "Calendar date of the revenue"
            tests: [not_null]
          - name: region
            tests:
                [
                    not_null,
                    accepted_values:
                        { values: ["NA", "EMEA", "APAC", "LATAM"] },
                ]
          - name: total_revenue
            description: "Sum of order_total in USD"
            tests: [not_null]

Materialization Decision Guide¶

Scenario	Materialization	Rationale
Large fact tables refreshed daily	`table`	Full rebuild is fast enough; query performance critical
Dimensions with SCD	`incremental` (merge)	Avoid full rebuild on large dimension tables
Real-time dashboards, small data	`view` / `materialized view`	Always-fresh, low compute for small datasets
Expensive aggregations, low freshness need	`table` with scheduled rebuild	Pre-compute once, query many times

Do / Don't¶

✅ Do	❌ Don't
Model as star schema (facts + dimensions)	Create wide, denormalized "mega-tables"
Document data contracts in YAML	Publish tables without ownership or SLA
Pre-aggregate at the grain consumers need	Force consumers to aggregate raw Silver data
Use surrogate keys for dimensions	Expose source-system natural keys to consumers
Version Gold schemas with contract tests	Make breaking changes without notice
Optimize for read performance	Optimize for write performance

Example: dbt Gold Model¶

-- models/gold/gld_sales_revenue_daily.sql

{{ config(
    materialized='table',
    partition_by={
        "field": "revenue_date",
        "data_type": "date",
        "granularity": "month"
    },
    cluster_by=['region', 'product_category'],
    tags=['gold', 'sales', 'daily']
) }}

WITH orders AS (
    SELECT
        o.order_date                                AS revenue_date,
        c.region,
        p.product_category,
        o.order_total                               AS revenue_amount,
        o.currency_code
    FROM {{ ref('slv_sales_orders') }} o
    INNER JOIN {{ ref('slv_customers_scd') }} c
        ON o.customer_id = c.customer_id
        AND c._is_current = TRUE
    INNER JOIN {{ ref('slv_products') }} p
        ON o.product_id = p.product_id
)

SELECT
    revenue_date,
    region,
    product_category,
    COUNT(*)                                        AS order_count,
    SUM(revenue_amount)                             AS total_revenue,
    AVG(revenue_amount)                             AS avg_order_value,
    CURRENT_TIMESTAMP()                             AS _refreshed_at

FROM orders
GROUP BY 1, 2, 3

Schema Evolution¶

Schema changes are inevitable. Handle them deliberately to avoid breaking downstream consumers.

Non-Breaking Changes (Safe)¶

These can be applied automatically:

Adding a new column with a default or nullable value
Widening a type (e.g., INT → BIGINT, VARCHAR(50) → VARCHAR(200))
Adding a new table

Breaking Changes (Require Coordination)¶

These require versioning and consumer communication:

Removing or renaming a column
Changing a column's data type (narrowing or incompatible)
Changing the grain of a table
Altering primary keys

Delta / Spark Schema Evolution¶

# Non-breaking: add new columns automatically
df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .saveAsTable("bronze.brz_api_events")

# Breaking: overwrite schema entirely (use with caution)
df.write.format("delta") \
    .option("overwriteSchema", "true") \
    .mode("overwrite") \
    .saveAsTable("bronze.brz_api_events")

Schema Overwrite

Only use overwriteSchema in Bronze where append-only semantics allow full reloads. Never use it on Silver or Gold tables without coordinating with downstream consumers.

Schema Evolution Workflow¶

flowchart TD
    A[Schema change detected] --> B{Breaking?}
    B -->|No| C[Apply with mergeSchema]
    B -->|Yes| D[Create versioned migration plan]
    D --> E[Notify consumers]
    E --> F[Deploy to staging]
    F --> G[Consumer validation]
    G --> H[Deploy to production]
    H --> I[Deprecate old schema after grace period]

Partitioning Strategies¶

Effective partitioning accelerates queries and reduces costs. Poor partitioning does the opposite.

Recommended Partition Keys¶

Layer	Partition Key	Rationale
Bronze	`_ingested_date`	Supports time-travel and retention management
Silver	Business date (e.g., `order_date`)	Aligns with how Silver is queried
Gold	Primary date dimension (e.g., `revenue_date`)	Matches dashboard date filters

Partition Pruning Tips¶

Always filter on partition columns in WHERE clauses.
Avoid functions on partition columns — WHERE YEAR(order_date) = 2024 won't prune; use WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'.
Use clustering (BigQuery) or Z-ordering (Delta) for secondary sort columns within partitions.

Over-Partitioning Anti-Pattern¶

Over-Partitioning Small Tables

Partitioning a table with fewer than 1 GB of data (or fewer than ~1 million rows) usually hurts performance. Each partition creates file overhead and metadata. For small tables, use materialized='table' without partitioning.

**Rule of thumb:** Only partition when each partition contains at least **1 GB** of data.

Cross-Layer Quality Gates¶

Quality gates are automated checkpoints between layers that prevent bad data from propagating downstream.

flowchart LR
    subgraph Bronze
        B[Raw data]
    end

    subgraph QG1["Quality Gate 1"]
        T1[not_null on PKs]
        T2[unique on PKs]
        T3[row count variance < 20%]
        T4[schema conformance]
    end

    subgraph Silver
        S[Cleansed data]
    end

    subgraph QG2["Quality Gate 2"]
        T5[referential integrity]
        T6[accepted_values on enums]
        T7[freshness < SLA]
        T8[aggregation balance checks]
    end

    subgraph Gold
        G[Analytics-ready data]
    end

    B --> QG1 --> S --> QG2 --> G

Quality Gate Implementation¶

# tests/quality_gates/qg_bronze_to_silver.yml
sources:
    - name: bronze
      freshness:
          warn_after: { count: 4, period: hour }
          error_after: { count: 8, period: hour }

models:
    - name: slv_sales_orders
      tests:
          - dbt_utils.equal_rowcount:
                compare_model: ref('brz_sap_sales_orders')
                # Allow 5% variance for dedup
                # Custom macro wraps this with tolerance
      columns:
          - name: order_id
            tests: [not_null, unique]
          - name: order_status
            tests:
                - accepted_values:
                      values:
                          [
                              "OPEN",
                              "SHIPPED",
                              "DELIVERED",
                              "CANCELLED",
                              "RETURNED",
                          ]
          - name: order_total
            tests:
                - dbt_utils.expression_is_true:
                      expression: ">= 0"

Anti-Patterns¶

Skipping the Silver Layer

Problem: Transforming directly from Bronze to Gold (Bronze → Gold).

**Why it's harmful:**

- Cleansing logic gets duplicated across every Gold model
- No single source of truth for an entity
- Changes to cleansing rules require updating multiple Gold models
- Lineage becomes tangled and ungovernable

**Fix:** Always route through Silver. Even if Silver and Gold logic seem simple today, the Silver layer provides a stable, reusable foundation as complexity grows.

Transforming Data in Bronze

Problem: Applying business rules, filtering rows, or renaming columns in the Bronze layer.

**Why it's harmful:**

- Destroys the ability to replay from raw data
- Makes debugging source issues impossible
- Violates the "full fidelity" principle

**Fix:** Bronze is append-only, no transformations. All cleansing and business logic belongs in Silver.

Gold Tables Without Data Contracts

Problem: Publishing Gold tables without documented ownership, SLA, schema, or consumers.

**Why it's harmful:**

- No accountability when data is late or wrong
- Breaking changes surprise consumers
- "Shadow" Gold tables proliferate because nobody trusts the originals

**Fix:** Every Gold table must have a YAML contract specifying `owner`, `sla_freshness`, `grain`, and `consumers`. Use `dbt model contracts` to enforce column types and constraints at build time.

Over-Partitioning Small Tables

Problem: Partitioning dimension tables or low-volume tables by date.

**Why it's harmful:**

- Creates thousands of tiny files (small-file problem)
- Increases metadata overhead and query planning time
- Actually **slows down** queries instead of speeding them up

**Fix:** Only partition tables where each partition contains ≥ 1 GB. For small tables, use clustering or no partitioning.

Cross-References¶

Architecture Decision Records — see docs/adr/ for decisions on lakehouse platform, partitioning strategy, and SCD approach
Data Governance Best Practices — see docs/best-practices/data-governance.md for classification, access control, and lineage requirements
Data Engineering Guide — see docs/best-practices/data-engineering.md for pipeline orchestration, idempotency, and monitoring patterns
dbt Style Guide — see docs/best-practices/dbt-style-guide.md (planned) for SQL formatting, naming conventions, and project structure