DOJ Antitrust: Step-by-Step Domain Build

DOJ Antitrust: Step-by-Step Domain Build¶

This guide walks through the complete process of building the DOJ Antitrust analytics domain from scratch using CSA-in-a-Box. It serves as a reference template for constructing any new domain in the framework.

Step 1: Data Discovery and Source Identification¶

Every domain begins with identifying authoritative data sources and understanding their structure, update cadence, and access patterns.

Discovery Process¶

For the DOJ Antitrust domain, the discovery process involved:

Catalog public data sources — Identify all publicly available datasets related to antitrust enforcement
Assess data formats — Determine whether sources provide CSV, JSON, PDF, HTML, or API access
Evaluate update frequency — Annual reports vs. real-time press releases vs. quarterly court data
Check licensing and terms — Government data is generally public domain, but verify
Sample the data — Download representative files and profile the contents

Sources Identified¶

Source	Format	Access Method	Refresh
HSR Annual Reports	PDF tables, some CSV	HTTP download	Annual
Criminal enforcement charts	HTML tables, PDF	Web scraping / manual extract	Periodic
FJC Integrated Database	CSV, SAS datasets	HTTP download	Quarterly
USSC Commission Datafiles	CSV, SAS datasets	HTTP download	Annual
DOJ press releases	HTML	RSS / web scraping	Ongoing

Data Format Challenges

Government data frequently arrives in PDF tables or legacy SAS formats. Plan for format conversion in your ingestion pipeline. For the seed data approach used here, manual extraction to CSV is acceptable; for production pipelines, consider tools like Tabula (PDF) or the pandas SAS reader.

Step 2: Schema Design (Bronze / Silver / Gold)¶

The medallion architecture provides clear separation of concerns across data maturity levels.

Bronze Layer — Raw Ingestion¶

Bronze models preserve source data exactly as received, adding only ingestion metadata.

-- models/bronze/brz_criminal_enforcement.sql
WITH source AS (
    SELECT * FROM {{ ref('criminal_enforcement_seed') }}
)

SELECT
    *,
    CURRENT_TIMESTAMP() AS _loaded_at,
    'doj_criminal_charts' AS _source_system,
    '{{ invocation_id }}' AS _dbt_run_id
FROM source

Bronze conventions:

Prefix: brz_
No data type changes or transformations
Add _loaded_at, _source_system, _dbt_run_id metadata columns
Materialized as incremental or table

Silver Layer — Cleaned and Typed¶

Silver models apply data typing, deduplication, renaming to business-friendly names, and basic validation.

-- models/silver/stg_criminal_enforcement.sql
WITH source AS (
    SELECT * FROM {{ ref('brz_criminal_enforcement') }}
)

SELECT
    CAST(fiscal_year AS INT) AS fiscal_year,
    TRIM(UPPER(violation_type)) AS violation_type,
    TRIM(defendant_name) AS defendant_name,
    CAST(fine_amount AS DECIMAL(18, 2)) AS fine_amount,
    CAST(sentence_months AS INT) AS sentence_months,
    CASE
        WHEN imprisonment = 'Yes' THEN TRUE
        WHEN imprisonment = 'No' THEN FALSE
        ELSE NULL
    END AS imprisonment_imposed,
    _loaded_at,
    _source_system
FROM source
WHERE fiscal_year IS NOT NULL

Silver conventions:

Prefix: stg_
Explicit data type casting
Consistent naming (snake_case, business-friendly)
Basic filtering (remove obviously invalid records)
Materialized as table or view

Gold Layer — Business-Ready¶

Gold models implement business logic, create dimensional models, and produce analytics-ready datasets.

-- models/gold/fact_enforcement_actions.sql
WITH criminal AS (
    SELECT * FROM {{ ref('stg_criminal_enforcement') }}
),

civil AS (
    SELECT * FROM {{ ref('stg_civil_enforcement') }}
),

combined AS (
    SELECT
        fiscal_year,
        'criminal' AS enforcement_type,
        violation_type,
        defendant_name,
        fine_amount,
        sentence_months,
        imprisonment_imposed
    FROM criminal

    UNION ALL

    SELECT
        fiscal_year,
        'civil' AS enforcement_type,
        violation_type,
        defendant_name,
        remedy_value AS fine_amount,
        NULL AS sentence_months,
        FALSE AS imprisonment_imposed
    FROM civil
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['fiscal_year', 'enforcement_type',
        'defendant_name']) }} AS enforcement_action_id,
    *
FROM combined

Gold conventions:

Fact tables: fact_ prefix with surrogate keys
Dimension tables: dim_ prefix
Business logic encapsulated here, not in silver
Materialized as table

Step 3: Seed Data Creation from Public Sources¶

For domains backed by static or slowly-changing public data, dbt seeds provide a simple, version-controlled ingestion mechanism.

Creating Seed Files¶

domains/doj_antitrust/
└── seeds/
    ├── criminal_enforcement_seed.csv
    ├── hsr_filing_statistics_seed.csv
    ├── antitrust_case_filings_seed.csv
    └── schema.yml

Example seed file (criminal_enforcement_seed.csv):

fiscal_year,violation_type,defendant_name,fine_amount,sentence_months,imprisonment
2023,Price Fixing,Example Corp,50000000,0,No
2023,Bid Rigging,John Doe,0,18,Yes
2022,Market Allocation,Sample Inc,25000000,0,No

Seed schema (schema.yml):

version: 2

seeds:
    - name: criminal_enforcement_seed
      description: >
          Criminal antitrust enforcement actions from DOJ published charts.
          Source: https://www.justice.gov/atr/criminal-enforcement-fine-and-jail-charts
      config:
          column_types:
              fiscal_year: integer
              fine_amount: float
              sentence_months: integer
      columns:
          - name: fiscal_year
            description: Federal fiscal year of the enforcement action
            tests:
                - not_null
          - name: violation_type
            description: Category of antitrust violation
            tests:
                - not_null
                - accepted_values:
                      values:
                          [
                              "Price Fixing",
                              "Bid Rigging",
                              "Market Allocation",
                              "Other",
                          ]

Seeds vs. Ingestion Pipelines

Seeds are ideal for reference data, lookup tables, and slowly-changing datasets that you want version-controlled. For high-volume or frequently-updated data, use ADF pipelines to land files directly in the bronze layer.

Step 4: dbt Model Development¶

Project Structure¶

domains/doj_antitrust/
├── dbt_project.yml
├── seeds/
│   ├── criminal_enforcement_seed.csv
│   └── schema.yml
├── models/
│   ├── bronze/
│   │   ├── brz_criminal_enforcement.sql
│   │   ├── brz_hsr_filings.sql
│   │   └── schema.yml
│   ├── silver/
│   │   ├── stg_criminal_enforcement.sql
│   │   ├── stg_hsr_filings.sql
│   │   └── schema.yml
│   └── gold/
│       ├── fact_enforcement_actions.sql
│       ├── fact_merger_reviews.sql
│       ├── dim_violation_types.sql
│       └── schema.yml
├── tests/
│   └── assert_valid_fiscal_years.sql
└── analyses/
    └── enforcement_trends.sql

dbt_project.yml Configuration¶

name: doj_antitrust
version: "1.0.0"
config-version: 2

profile: "databricks"

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
analysis-paths: ["analyses"]

models:
    doj_antitrust:
        bronze:
            +materialized: table
            +schema: bronze
            +tags: ["bronze", "doj"]
        silver:
            +materialized: table
            +schema: silver
            +tags: ["silver", "doj"]
        gold:
            +materialized: table
            +schema: gold
            +tags: ["gold", "doj"]

Step 5: Data Quality with Flag-Don't-Drop¶

CSA-in-a-Box uses the flag-don't-drop pattern for data quality. Instead of silently dropping records that fail validation, the pattern flags them with quality indicators and preserves them for investigation.

Implementation¶

-- models/silver/stg_criminal_enforcement.sql
WITH source AS (
    SELECT * FROM {{ ref('brz_criminal_enforcement') }}
),

validated AS (
    SELECT
        *,

        -- Quality flags
        CASE
            WHEN fiscal_year IS NULL THEN 'FAIL'
            WHEN fiscal_year < 1890 OR fiscal_year > YEAR(CURRENT_DATE()) THEN 'WARN'
            ELSE 'PASS'
        END AS _dq_fiscal_year,

        CASE
            WHEN fine_amount IS NOT NULL AND fine_amount < 0 THEN 'FAIL'
            ELSE 'PASS'
        END AS _dq_fine_amount,

        CASE
            WHEN violation_type IS NULL THEN 'FAIL'
            WHEN violation_type NOT IN ('Price Fixing', 'Bid Rigging',
                'Market Allocation', 'Other') THEN 'WARN'
            ELSE 'PASS'
        END AS _dq_violation_type

    FROM source
)

SELECT
    *,
    CASE
        WHEN _dq_fiscal_year = 'FAIL'
            OR _dq_fine_amount = 'FAIL'
            OR _dq_violation_type = 'FAIL'
        THEN 'QUARANTINE'
        WHEN _dq_fiscal_year = 'WARN'
            OR _dq_violation_type = 'WARN'
        THEN 'WARNING'
        ELSE 'CLEAN'
    END AS _dq_status
FROM validated

Quality Status Levels¶

Status	Meaning	Action
`CLEAN`	All checks passed	Include in gold layer
`WARNING`	Minor issues detected	Include in gold with flag; review periodically
`QUARANTINE`	Critical issues	Exclude from gold; route to data steward

Why Flag-Don't-Drop?

Dropping records silently creates invisible data loss. Government data often contains legitimate edge cases (historical records, unusual formats). Flagging preserves the data for human review while protecting downstream analytics from bad inputs.

Step 6: Gold Layer Business Logic¶

Gold models combine cleaned silver data with business rules to produce analytics-ready tables.

Dimension Tables¶

-- models/gold/dim_violation_types.sql
SELECT DISTINCT
    {{ dbt_utils.generate_surrogate_key(['violation_type']) }} AS violation_type_id,
    violation_type,
    CASE violation_type
        WHEN 'Price Fixing' THEN 'Sherman Act Section 1'
        WHEN 'Bid Rigging' THEN 'Sherman Act Section 1'
        WHEN 'Market Allocation' THEN 'Sherman Act Section 1'
        WHEN 'Monopolization' THEN 'Sherman Act Section 2'
        WHEN 'Merger' THEN 'Clayton Act Section 7'
        ELSE 'Other'
    END AS governing_statute,
    CASE violation_type
        WHEN 'Price Fixing' THEN TRUE
        WHEN 'Bid Rigging' THEN TRUE
        WHEN 'Market Allocation' THEN TRUE
        ELSE FALSE
    END AS is_per_se_criminal
FROM {{ ref('stg_criminal_enforcement') }}
WHERE _dq_status != 'QUARANTINE'

Fact Tables with Metrics¶

-- models/gold/fact_merger_reviews.sql
SELECT
    fiscal_year,
    total_hsr_transactions_reported,
    transactions_granted_early_termination,
    second_requests_issued,
    ROUND(
        second_requests_issued * 100.0
        / NULLIF(total_hsr_transactions_reported, 0), 2
    ) AS second_request_rate_pct,
    merger_challenges_filed,
    mergers_abandoned,
    consent_decrees_entered
FROM {{ ref('stg_hsr_filings') }}
WHERE _dq_status != 'QUARANTINE'

Step 7: Data Product Contracts¶

Data product contracts define the schema, SLAs, and ownership of gold-layer datasets exposed to consumers.

# data_products/doj_enforcement_actions.yml
data_product:
    name: doj_enforcement_actions
    version: "1.0.0"
    domain: doj_antitrust
    owner: analytics-team
    description: >
        Consolidated antitrust enforcement actions combining criminal and civil
        cases from DOJ published data. Includes fines, sentences, and violation
        classifications.

    schema:
        table: fact_enforcement_actions
        columns:
            - name: enforcement_action_id
              type: STRING
              description: Surrogate key
              pii: false
            - name: fiscal_year
              type: INT
              description: Federal fiscal year
            - name: enforcement_type
              type: STRING
              description: "criminal or civil"
            - name: violation_type
              type: STRING
              description: Category of antitrust violation
            - name: fine_amount
              type: DECIMAL(18,2)
              description: Fine or remedy amount in USD

    sla:
        freshness: monthly
        completeness: ">= 95%"
        availability: "99.5%"

    classification:
        sensitivity: public
        pii: false
        export_control: none

    consumers:
        - power_bi_enforcement_dashboard
        - research_notebooks

Step 8: Analytics Notebooks¶

Databricks notebooks consume gold-layer tables for exploratory analysis and visualization.

# notebooks/enforcement_trends.py

# Load gold layer tables
enforcement = spark.table("gold.fact_enforcement_actions")
mergers = spark.table("gold.fact_merger_reviews")

# Criminal enforcement trends
criminal_trends = (
    enforcement
    .filter(col("enforcement_type") == "criminal")
    .groupBy("fiscal_year", "violation_type")
    .agg(
        count("*").alias("case_count"),
        sum("fine_amount").alias("total_fines"),
        avg("sentence_months").alias("avg_sentence_months")
    )
    .orderBy("fiscal_year")
)

display(criminal_trends)

# Merger review intensity
merger_trends = (
    mergers
    .select(
        "fiscal_year",
        "total_hsr_transactions_reported",
        "second_request_rate_pct",
        "merger_challenges_filed"
    )
    .orderBy("fiscal_year")
)

display(merger_trends)

Step 9: Connecting to Azure Services¶

Azure Data Factory Integration¶

For production deployments, replace seed-based ingestion with ADF pipelines:

graph LR
    subgraph ADF Pipeline
        A[HTTP Source<br/>FJC/USSC] --> B[Copy Activity]
        B --> C[ADLS Gen2<br/>Bronze Zone]
        C --> D[Databricks<br/>dbt run]
    end

Pipeline configuration:

Linked Service — HTTP connector to government data endpoints
Dataset — CSV/JSON format definitions matching source schemas
Copy Activity — Land raw files in bronze/doj_antitrust/ container path
Databricks Activity — Trigger dbt run --select tag:doj after ingestion

Microsoft Purview Catalog¶

Register gold-layer tables in Purview for discoverability and governance:

Classification — Apply sensitivity labels (public data in this case)
Lineage — Auto-captured from ADF and Databricks activities
Glossary — Map business terms (e.g., "HSR Filing" → fact_merger_reviews)
Access Policies — Control who can query enforcement data

Power BI Dashboards¶

Connect Power BI directly to gold-layer Delta tables via Databricks SQL endpoint or Synapse Serverless for executive dashboards showing:

Enforcement action trends over time
Fine amount distributions by violation type
Merger review pipeline metrics
Geographic distribution of cases

Summary¶

Building the DOJ Antitrust domain followed the standard CSA-in-a-Box pattern:

Step	What	Why
1. Discovery	Identify sources and formats	Know what you're working with
2. Schema design	Bronze/Silver/Gold layers	Separation of concerns
3. Seed data	CSV files from public sources	Version-controlled, reproducible
4. dbt models	SQL transformations per layer	Testable, documented
5. Data quality	Flag-don't-drop pattern	Preserve data, protect analytics
6. Gold logic	Business rules and dimensions	Analytics-ready output
7. Contracts	Schema + SLA definitions	Consumer confidence
8. Notebooks	Exploratory analytics	Derive insights
9. Azure services	ADF, Purview, Power BI	Production operations

This pattern is repeatable for any new domain. Replace the data sources and business logic, but keep the architecture consistent.