DOJ Antitrust: Step-by-Step Domain Build
DOJ Antitrust: Step-by-Step Domain Build¶
This guide walks through the complete process of building the DOJ Antitrust analytics domain from scratch using CSA-in-a-Box. It serves as a reference template for constructing any new domain in the framework.
Step 1: Data Discovery and Source Identification¶
Every domain begins with identifying authoritative data sources and understanding their structure, update cadence, and access patterns.
Discovery Process¶
For the DOJ Antitrust domain, the discovery process involved:
- Catalog public data sources — Identify all publicly available datasets related to antitrust enforcement
- Assess data formats — Determine whether sources provide CSV, JSON, PDF, HTML, or API access
- Evaluate update frequency — Annual reports vs. real-time press releases vs. quarterly court data
- Check licensing and terms — Government data is generally public domain, but verify
- Sample the data — Download representative files and profile the contents
Sources Identified¶
| Source | Format | Access Method | Refresh |
|---|---|---|---|
| HSR Annual Reports | PDF tables, some CSV | HTTP download | Annual |
| Criminal enforcement charts | HTML tables, PDF | Web scraping / manual extract | Periodic |
| FJC Integrated Database | CSV, SAS datasets | HTTP download | Quarterly |
| USSC Commission Datafiles | CSV, SAS datasets | HTTP download | Annual |
| DOJ press releases | HTML | RSS / web scraping | Ongoing |
Data Format Challenges
Government data frequently arrives in PDF tables or legacy SAS formats. Plan for format conversion in your ingestion pipeline. For the seed data approach used here, manual extraction to CSV is acceptable; for production pipelines, consider tools like Tabula (PDF) or the pandas SAS reader.
Step 2: Schema Design (Bronze / Silver / Gold)¶
The medallion architecture provides clear separation of concerns across data maturity levels.
Bronze Layer — Raw Ingestion¶
Bronze models preserve source data exactly as received, adding only ingestion metadata.
-- models/bronze/brz_criminal_enforcement.sql
WITH source AS (
SELECT * FROM {{ ref('criminal_enforcement_seed') }}
)
SELECT
*,
CURRENT_TIMESTAMP() AS _loaded_at,
'doj_criminal_charts' AS _source_system,
'{{ invocation_id }}' AS _dbt_run_id
FROM source
Bronze conventions:
- Prefix:
brz_ - No data type changes or transformations
- Add
_loaded_at,_source_system,_dbt_run_idmetadata columns - Materialized as incremental or table
Silver Layer — Cleaned and Typed¶
Silver models apply data typing, deduplication, renaming to business-friendly names, and basic validation.
-- models/silver/stg_criminal_enforcement.sql
WITH source AS (
SELECT * FROM {{ ref('brz_criminal_enforcement') }}
)
SELECT
CAST(fiscal_year AS INT) AS fiscal_year,
TRIM(UPPER(violation_type)) AS violation_type,
TRIM(defendant_name) AS defendant_name,
CAST(fine_amount AS DECIMAL(18, 2)) AS fine_amount,
CAST(sentence_months AS INT) AS sentence_months,
CASE
WHEN imprisonment = 'Yes' THEN TRUE
WHEN imprisonment = 'No' THEN FALSE
ELSE NULL
END AS imprisonment_imposed,
_loaded_at,
_source_system
FROM source
WHERE fiscal_year IS NOT NULL
Silver conventions:
- Prefix:
stg_ - Explicit data type casting
- Consistent naming (snake_case, business-friendly)
- Basic filtering (remove obviously invalid records)
- Materialized as table or view
Gold Layer — Business-Ready¶
Gold models implement business logic, create dimensional models, and produce analytics-ready datasets.
-- models/gold/fact_enforcement_actions.sql
WITH criminal AS (
SELECT * FROM {{ ref('stg_criminal_enforcement') }}
),
civil AS (
SELECT * FROM {{ ref('stg_civil_enforcement') }}
),
combined AS (
SELECT
fiscal_year,
'criminal' AS enforcement_type,
violation_type,
defendant_name,
fine_amount,
sentence_months,
imprisonment_imposed
FROM criminal
UNION ALL
SELECT
fiscal_year,
'civil' AS enforcement_type,
violation_type,
defendant_name,
remedy_value AS fine_amount,
NULL AS sentence_months,
FALSE AS imprisonment_imposed
FROM civil
)
SELECT
{{ dbt_utils.generate_surrogate_key(['fiscal_year', 'enforcement_type',
'defendant_name']) }} AS enforcement_action_id,
*
FROM combined
Gold conventions:
- Fact tables:
fact_prefix with surrogate keys - Dimension tables:
dim_prefix - Business logic encapsulated here, not in silver
- Materialized as table
Step 3: Seed Data Creation from Public Sources¶
For domains backed by static or slowly-changing public data, dbt seeds provide a simple, version-controlled ingestion mechanism.
Creating Seed Files¶
domains/doj_antitrust/
└── seeds/
├── criminal_enforcement_seed.csv
├── hsr_filing_statistics_seed.csv
├── antitrust_case_filings_seed.csv
└── schema.yml
Example seed file (criminal_enforcement_seed.csv):
fiscal_year,violation_type,defendant_name,fine_amount,sentence_months,imprisonment
2023,Price Fixing,Example Corp,50000000,0,No
2023,Bid Rigging,John Doe,0,18,Yes
2022,Market Allocation,Sample Inc,25000000,0,No
Seed schema (schema.yml):
version: 2
seeds:
- name: criminal_enforcement_seed
description: >
Criminal antitrust enforcement actions from DOJ published charts.
Source: https://www.justice.gov/atr/criminal-enforcement-fine-and-jail-charts
config:
column_types:
fiscal_year: integer
fine_amount: float
sentence_months: integer
columns:
- name: fiscal_year
description: Federal fiscal year of the enforcement action
tests:
- not_null
- name: violation_type
description: Category of antitrust violation
tests:
- not_null
- accepted_values:
values:
[
"Price Fixing",
"Bid Rigging",
"Market Allocation",
"Other",
]
Seeds vs. Ingestion Pipelines
Seeds are ideal for reference data, lookup tables, and slowly-changing datasets that you want version-controlled. For high-volume or frequently-updated data, use ADF pipelines to land files directly in the bronze layer.
Step 4: dbt Model Development¶
Project Structure¶
domains/doj_antitrust/
├── dbt_project.yml
├── seeds/
│ ├── criminal_enforcement_seed.csv
│ └── schema.yml
├── models/
│ ├── bronze/
│ │ ├── brz_criminal_enforcement.sql
│ │ ├── brz_hsr_filings.sql
│ │ └── schema.yml
│ ├── silver/
│ │ ├── stg_criminal_enforcement.sql
│ │ ├── stg_hsr_filings.sql
│ │ └── schema.yml
│ └── gold/
│ ├── fact_enforcement_actions.sql
│ ├── fact_merger_reviews.sql
│ ├── dim_violation_types.sql
│ └── schema.yml
├── tests/
│ └── assert_valid_fiscal_years.sql
└── analyses/
└── enforcement_trends.sql
dbt_project.yml Configuration¶
name: doj_antitrust
version: "1.0.0"
config-version: 2
profile: "databricks"
model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
models:
doj_antitrust:
bronze:
+materialized: table
+schema: bronze
+tags: ["bronze", "doj"]
silver:
+materialized: table
+schema: silver
+tags: ["silver", "doj"]
gold:
+materialized: table
+schema: gold
+tags: ["gold", "doj"]
Step 5: Data Quality with Flag-Don't-Drop¶
CSA-in-a-Box uses the flag-don't-drop pattern for data quality. Instead of silently dropping records that fail validation, the pattern flags them with quality indicators and preserves them for investigation.
Implementation¶
-- models/silver/stg_criminal_enforcement.sql
WITH source AS (
SELECT * FROM {{ ref('brz_criminal_enforcement') }}
),
validated AS (
SELECT
*,
-- Quality flags
CASE
WHEN fiscal_year IS NULL THEN 'FAIL'
WHEN fiscal_year < 1890 OR fiscal_year > YEAR(CURRENT_DATE()) THEN 'WARN'
ELSE 'PASS'
END AS _dq_fiscal_year,
CASE
WHEN fine_amount IS NOT NULL AND fine_amount < 0 THEN 'FAIL'
ELSE 'PASS'
END AS _dq_fine_amount,
CASE
WHEN violation_type IS NULL THEN 'FAIL'
WHEN violation_type NOT IN ('Price Fixing', 'Bid Rigging',
'Market Allocation', 'Other') THEN 'WARN'
ELSE 'PASS'
END AS _dq_violation_type
FROM source
)
SELECT
*,
CASE
WHEN _dq_fiscal_year = 'FAIL'
OR _dq_fine_amount = 'FAIL'
OR _dq_violation_type = 'FAIL'
THEN 'QUARANTINE'
WHEN _dq_fiscal_year = 'WARN'
OR _dq_violation_type = 'WARN'
THEN 'WARNING'
ELSE 'CLEAN'
END AS _dq_status
FROM validated
Quality Status Levels¶
| Status | Meaning | Action |
|---|---|---|
CLEAN | All checks passed | Include in gold layer |
WARNING | Minor issues detected | Include in gold with flag; review periodically |
QUARANTINE | Critical issues | Exclude from gold; route to data steward |
Why Flag-Don't-Drop?
Dropping records silently creates invisible data loss. Government data often contains legitimate edge cases (historical records, unusual formats). Flagging preserves the data for human review while protecting downstream analytics from bad inputs.
Step 6: Gold Layer Business Logic¶
Gold models combine cleaned silver data with business rules to produce analytics-ready tables.
Dimension Tables¶
-- models/gold/dim_violation_types.sql
SELECT DISTINCT
{{ dbt_utils.generate_surrogate_key(['violation_type']) }} AS violation_type_id,
violation_type,
CASE violation_type
WHEN 'Price Fixing' THEN 'Sherman Act Section 1'
WHEN 'Bid Rigging' THEN 'Sherman Act Section 1'
WHEN 'Market Allocation' THEN 'Sherman Act Section 1'
WHEN 'Monopolization' THEN 'Sherman Act Section 2'
WHEN 'Merger' THEN 'Clayton Act Section 7'
ELSE 'Other'
END AS governing_statute,
CASE violation_type
WHEN 'Price Fixing' THEN TRUE
WHEN 'Bid Rigging' THEN TRUE
WHEN 'Market Allocation' THEN TRUE
ELSE FALSE
END AS is_per_se_criminal
FROM {{ ref('stg_criminal_enforcement') }}
WHERE _dq_status != 'QUARANTINE'
Fact Tables with Metrics¶
-- models/gold/fact_merger_reviews.sql
SELECT
fiscal_year,
total_hsr_transactions_reported,
transactions_granted_early_termination,
second_requests_issued,
ROUND(
second_requests_issued * 100.0
/ NULLIF(total_hsr_transactions_reported, 0), 2
) AS second_request_rate_pct,
merger_challenges_filed,
mergers_abandoned,
consent_decrees_entered
FROM {{ ref('stg_hsr_filings') }}
WHERE _dq_status != 'QUARANTINE'
Step 7: Data Product Contracts¶
Data product contracts define the schema, SLAs, and ownership of gold-layer datasets exposed to consumers.
# data_products/doj_enforcement_actions.yml
data_product:
name: doj_enforcement_actions
version: "1.0.0"
domain: doj_antitrust
owner: analytics-team
description: >
Consolidated antitrust enforcement actions combining criminal and civil
cases from DOJ published data. Includes fines, sentences, and violation
classifications.
schema:
table: fact_enforcement_actions
columns:
- name: enforcement_action_id
type: STRING
description: Surrogate key
pii: false
- name: fiscal_year
type: INT
description: Federal fiscal year
- name: enforcement_type
type: STRING
description: "criminal or civil"
- name: violation_type
type: STRING
description: Category of antitrust violation
- name: fine_amount
type: DECIMAL(18,2)
description: Fine or remedy amount in USD
sla:
freshness: monthly
completeness: ">= 95%"
availability: "99.5%"
classification:
sensitivity: public
pii: false
export_control: none
consumers:
- power_bi_enforcement_dashboard
- research_notebooks
Step 8: Analytics Notebooks¶
Databricks notebooks consume gold-layer tables for exploratory analysis and visualization.
# notebooks/enforcement_trends.py
# Load gold layer tables
enforcement = spark.table("gold.fact_enforcement_actions")
mergers = spark.table("gold.fact_merger_reviews")
# Criminal enforcement trends
criminal_trends = (
enforcement
.filter(col("enforcement_type") == "criminal")
.groupBy("fiscal_year", "violation_type")
.agg(
count("*").alias("case_count"),
sum("fine_amount").alias("total_fines"),
avg("sentence_months").alias("avg_sentence_months")
)
.orderBy("fiscal_year")
)
display(criminal_trends)
# Merger review intensity
merger_trends = (
mergers
.select(
"fiscal_year",
"total_hsr_transactions_reported",
"second_request_rate_pct",
"merger_challenges_filed"
)
.orderBy("fiscal_year")
)
display(merger_trends)
Step 9: Connecting to Azure Services¶
Azure Data Factory Integration¶
For production deployments, replace seed-based ingestion with ADF pipelines:
graph LR
subgraph ADF Pipeline
A[HTTP Source<br/>FJC/USSC] --> B[Copy Activity]
B --> C[ADLS Gen2<br/>Bronze Zone]
C --> D[Databricks<br/>dbt run]
end Pipeline configuration:
- Linked Service — HTTP connector to government data endpoints
- Dataset — CSV/JSON format definitions matching source schemas
- Copy Activity — Land raw files in
bronze/doj_antitrust/container path - Databricks Activity — Trigger
dbt run --select tag:dojafter ingestion
Microsoft Purview Catalog¶
Register gold-layer tables in Purview for discoverability and governance:
- Classification — Apply sensitivity labels (public data in this case)
- Lineage — Auto-captured from ADF and Databricks activities
- Glossary — Map business terms (e.g., "HSR Filing" →
fact_merger_reviews) - Access Policies — Control who can query enforcement data
Power BI Dashboards¶
Connect Power BI directly to gold-layer Delta tables via Databricks SQL endpoint or Synapse Serverless for executive dashboards showing:
- Enforcement action trends over time
- Fine amount distributions by violation type
- Merger review pipeline metrics
- Geographic distribution of cases
Summary¶
Building the DOJ Antitrust domain followed the standard CSA-in-a-Box pattern:
| Step | What | Why |
|---|---|---|
| 1. Discovery | Identify sources and formats | Know what you're working with |
| 2. Schema design | Bronze/Silver/Gold layers | Separation of concerns |
| 3. Seed data | CSV files from public sources | Version-controlled, reproducible |
| 4. dbt models | SQL transformations per layer | Testable, documented |
| 5. Data quality | Flag-don't-drop pattern | Preserve data, protect analytics |
| 6. Gold logic | Business rules and dimensions | Analytics-ready output |
| 7. Contracts | Schema + SLA definitions | Consumer confidence |
| 8. Notebooks | Exploratory analytics | Derive insights |
| 9. Azure services | ADF, Purview, Power BI | Production operations |
This pattern is repeatable for any new domain. Replace the data sources and business logic, but keep the architecture consistent.