⚖️ Tutorial 38: DOJ Justice Analytics¶
Third-party references — publicly sourced, good-faith comparison
This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.
⚖️ Tutorial 38: DOJ Justice Analytics¶
| Difficulty | ⭐⭐ Intermediate |
| Time | ⏱️ 90-120 minutes |
| Focus | Crime Statistics, Federal Sentencing, Antitrust Enforcement, Drug Seizure Analytics |
📊 Progress Tracker¶
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 00 │ 01 │ 02 │ 03 │ 04 │ 05 │ 06 │ 07 │ 08 │ 09 │ 10 │ 11 │ 12 │ 13 │
│SETUP │BRNZE │SILVR │ GOLD │ RT │ PBI │PIPES │ GOV │MIRRR │AI/ML │TDATA │ SAS │CICD │MIGR │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 14 │ 15 │ 16 │ 17 │ 18 │ 19 │ 20 │ 21 │ 22 │ 23 │ 24 │ 25 │ 26 │ 27 │ 28 │ 29 │ 30 │ 31 │
│ SEC │ COST │PERF │ MON │SHARE │COPLT │WKBST │ GEO │ NET │SHIR │ SNW │ DB2 │MULTI │VIDEO │ MOVE │GEOLC │TRIBL │ DOT │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 32 │ 33 │ 34 │ 35 │ 36 │ 37 │ 38 │
│ USDA │ SBA │ NOAA │ EPA │ DOI │GRAPH │ DOJ │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ ✅ │ 🔵 │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▲
YOU ARE HERE
| Navigation | |
|---|---|
| ⬅️ Previous | 37-Graph Analytics |
| ➡️ Next | — (Final Tutorial) |
📖 Overview¶
The Department of Justice (DOJ) is the largest law enforcement agency in the world, encompassing the FBI, DEA, Bureau of Prisons, US Attorneys, and the Antitrust Division. This tutorial builds a complete analytics pipeline for four DOJ data domains using Microsoft Fabric's medallion architecture.
What You'll Build¶
graph LR
subgraph Sources["📥 Data Sources"]
FBI["FBI Crime Data Explorer<br/>cde.ucr.cjis.gov"]
USSC["US Sentencing Commission<br/>ussc.gov"]
ATR["DOJ Antitrust Division<br/>justice.gov/atr"]
DEA["Drug Enforcement Admin<br/>dea.gov"]
end
subgraph Bronze["🥉 Bronze Layer"]
B1["bronze_doj_crime_stats"]
B2["bronze_doj_federal_cases"]
B3["bronze_doj_antitrust"]
B4["bronze_doj_drug_enforcement"]
end
subgraph Silver["🥈 Silver Layer"]
S1["silver_doj_crime_stats<br/>NIBRS validated"]
S2["silver_doj_federal_cases<br/>District validated"]
S3["silver_doj_antitrust<br/>HHI classified"]
S4["silver_doj_drug_enforcement<br/>Schedule validated"]
end
subgraph Gold["🥇 Gold Layer"]
G1["gold_doj_crime_trends"]
G2["gold_doj_sentencing_analytics"]
G3["gold_doj_antitrust_metrics"]
G4["gold_doj_drug_enforcement"]
end
subgraph BI["📊 Power BI"]
PBI["DOJ Justice Analytics<br/>4-Page Dashboard"]
end
FBI --> B1 --> S1 --> G1 --> PBI
USSC --> B2 --> S2 --> G2 --> PBI
ATR --> B3 --> S3 --> G3 --> PBI
DEA --> B4 --> S4 --> G4 --> PBI Four DOJ Domains¶
| Domain | Source Agency | Key Data | Records |
|---|---|---|---|
| Crime Statistics | FBI (UCR/NIBRS) | Incidents, offenses, victims, clearances by 18,000+ agencies | ~700K/year |
| Federal Cases | US Sentencing Commission | Sentencing data across 94 districts, 13 circuits | ~70K/year |
| Antitrust | DOJ Antitrust Division | Merger filings, HSR reviews, cartel prosecutions | ~2K/year |
| Drug Enforcement | DEA | Drug seizures by type, schedule, region, quantity | ~50K/year |
📋 Prerequisites¶
Before starting this tutorial, ensure you have:
- Completed Tutorial 00: Environment Setup
- Completed Tutorial 01-03: Bronze/Silver/Gold Layers
- A Microsoft Fabric workspace with a Lakehouse
- Python 3.10+ with
numpy,faker,pandasinstalled - (Optional) FBI API key from api.usa.gov for live data
🔧 Step 1: Generate Synthetic DOJ Data¶
The DOJ data generator produces realistic records across all four domains.
1.1 Install Dependencies¶
1.2 Generate Records¶
from generators.federal.doj_generator import DOJGenerator
gen = DOJGenerator(seed=42)
# Generate 10,000 records per domain
crime_records = gen.generate_batch(count=10000, domain="crime_stats")
cases_records = gen.generate_batch(count=10000, domain="federal_cases")
antitrust_records = gen.generate_batch(count=5000, domain="antitrust")
drug_records = gen.generate_batch(count=10000, domain="drug_enforcement")
print(f"Crime stats: {len(crime_records)} records")
print(f"Federal cases: {len(cases_records)} records")
print(f"Antitrust: {len(antitrust_records)} records")
print(f"Drug enforcement: {len(drug_records)} records")
1.3 Explore a Sample Record¶
# Crime statistics record
record = gen.generate_record(domain="crime_stats")
print(f"Incident: {record['incident_id']}")
print(f"Agency: {record['agency_name']} ({record['ori_code']})")
print(f"State: {record['state_code']}")
print(f"Offense: {record['offense_description']} ({record['offense_code']})")
print(f"Category: {record['offense_category']}")
print(f"Victims: {record['victim_count']}")
print(f"Clearance: {record['clearance_status']}")
1.4 Verify with Unit Tests¶
Checkpoint: You should see 29 tests passing across crime_stats, federal_cases, antitrust, and drug_enforcement domains.
📥 Step 2: Download Real Open Data¶
The DOJ download module provides access to real, publicly available federal datasets.
2.1 FBI Crime Data Explorer¶
# Download FBI NIBRS data (requires free API key from api.usa.gov)
python -m data_generation.open_data.doj_download \
--dataset fbi_crime \
--output-dir ./data/doj \
--year 2023
2.2 All DOJ Datasets¶
# Download all available datasets
python -m data_generation.open_data.doj_download \
--dataset all \
--output-dir ./data/doj
Available Datasets¶
| Dataset | Command | Source | Auth Required |
|---|---|---|---|
| FBI Crime Data | --dataset fbi_crime | FBI CDE REST API | Free API key |
| Federal Sentencing | --dataset sentencing | USSC bulk files | No |
| Antitrust Cases | --dataset antitrust | data.gov CSV | No |
| HSR Filings | --dataset hsr | data.gov CSV | No |
| DEA Seizures | --dataset dea | DEA statistics | No |
Tip: If APIs are unavailable, the download module falls back to representative synthetic data matching published statistical distributions.
🥉 Step 3: Bronze Layer Ingestion¶
Import and run the Bronze notebook in your Fabric workspace.
3.1 Import Notebook¶
- Navigate to your Fabric Lakehouse
- Click Import notebook
- Select
notebooks/bronze/18_bronze_doj.py
3.2 Bronze Schema Overview¶
The Bronze layer ingests raw data with minimal transformation and adds metadata columns.
Crime Statistics Schema:
incident_id STRING -- UUID primary key
ori_code STRING -- Agency ORI identifier
agency_name STRING -- Reporting agency name
agency_type STRING -- City/County/State/Federal/Tribal
state_code STRING -- 2-letter state abbreviation
incident_date DATE -- Date of incident
offense_code STRING -- NIBRS offense code (30 codes)
offense_category STRING -- Persons/Property/Society
offense_description STRING -- Human-readable offense name
victim_count INTEGER -- Number of victims (0 for Society)
offender_count INTEGER -- Number of known offenders
arrest_made BOOLEAN -- Whether arrest was made
weapon_involved STRING -- None/Firearm/Knife/Other
location_type STRING -- 15 location categories
clearance_status STRING -- Cleared/Not Cleared/Exceptionally Cleared
population_group STRING -- City population size bucket
reporting_year INTEGER -- Reporting year
_bronze_ingested_at TIMESTAMP -- Ingestion timestamp
_bronze_source_file STRING -- Source file path
_bronze_batch_id STRING -- Batch identifier
_bronze_load_date DATE -- Partition key
3.3 Run the Notebook¶
Execute all cells. The notebook creates four Delta tables:
| Table | Partition | Records |
|---|---|---|
bronze_doj_crime_stats | _bronze_load_date | 10,000 |
bronze_doj_federal_cases | _bronze_load_date | 10,000 |
bronze_doj_antitrust | _bronze_load_date | 5,000 |
bronze_doj_drug_enforcement | _bronze_load_date | 10,000 |
Checkpoint: Verify all four tables appear in your Lakehouse explorer with the expected row counts.
🥈 Step 4: Silver Layer Transformations¶
The Silver notebook cleanses, validates, and enriches the Bronze data.
4.1 Import and Run¶
Import notebooks/silver/18_silver_doj.py into your Fabric workspace.
4.2 Key Transformations¶
NIBRS Offense Code Validation:
# Validate against 30 known NIBRS offense codes
VALID_NIBRS_CODES = {"09A", "09B", "09C", "100", "11A", "11B", "11C", "11D",
"120", "13A", "13B", "13C", "200", "210", "220", "23A",
"23B", "23C", "23D", "23E", "23F", "23G", "23H", "240",
"250", "26A", "26B", "270", "290", "35A", "35B"}
df = df.withColumn("nibrs_valid",
F.when(F.col("offense_code").isin(VALID_NIBRS_CODES), True)
.otherwise(False))
HHI Market Concentration Classification (2023 Merger Guidelines):
# Apply DOJ/FTC 2023 Merger Guidelines thresholds
df = df.withColumn("concentration_level",
F.when(F.col("hhi_post_merger") > 2500, "Highly Concentrated")
.when(F.col("hhi_post_merger") >= 1500, "Moderately Concentrated")
.otherwise("Unconcentrated"))
# Flag presumptively anticompetitive mergers
df = df.withColumn("merger_presumption",
F.when(
(F.col("hhi_post_merger") > 2500) & (F.col("hhi_delta") > 200), True
).when(
(F.col("hhi_post_merger") >= 1500) & (F.col("hhi_delta") > 100), True
).otherwise(False))
Federal District Validation:
# Validate against 94 federal judicial districts + DC
FEDERAL_DISTRICTS = ["SDNY", "EDNY", "NDCA", "CDCA", "NDIL", ...] # 95 total
df = df.withColumn("district_valid",
F.col("district_court").isin(FEDERAL_DISTRICTS))
Data Quality Scoring:
# DQ score (0-100) based on field completeness and validity
df = df.withColumn("_dq_score",
(F.when(F.col("nibrs_valid"), 25).otherwise(0) +
F.when(F.col("state_code").rlike("^[A-Z]{2}$"), 25).otherwise(0) +
F.when(F.col("victim_count") >= 0, 25).otherwise(0) +
F.when(F.col("clearance_status").isNotNull(), 25).otherwise(0)))
4.3 Silver Output Tables¶
| Table | Key Enrichments |
|---|---|
silver_doj_crime_stats | NIBRS validation, arrest_rate, crime_severity |
silver_doj_federal_cases | District validation, guideline compliance, departure analysis |
silver_doj_antitrust | HHI classification, merger_presumption flag |
silver_doj_drug_enforcement | Schedule validation, street_value_per_kg, outlier flags |
Checkpoint: Query
silver_doj_antitrustand verify thatmerger_presumption = trueonly for records wherehhi_post_merger > 2500 AND hhi_delta > 200.
🥇 Step 5: Gold Layer Analytics¶
The Gold notebook produces four aggregation tables optimized for Power BI Direct Lake.
5.1 Import and Run¶
Import notebooks/gold/19_gold_doj_analytics.py into your Fabric workspace.
5.2 Gold Table: Crime Trends¶
-- gold_doj_crime_trends: Crime patterns by state, category, year
SELECT state_code, offense_category, reporting_year,
COUNT(*) as incident_count,
SUM(victim_count) as total_victims,
AVG(CASE WHEN clearance_status = 'Cleared by Arrest' THEN 1.0 ELSE 0.0 END) as clearance_rate,
AVG(CASE WHEN arrest_made THEN 1.0 ELSE 0.0 END) as arrest_rate,
AVG(CASE WHEN weapon_involved = 'Firearm' THEN 1.0 ELSE 0.0 END) as firearm_rate
FROM silver_doj_crime_stats
GROUP BY state_code, offense_category, reporting_year
5.3 Gold Table: Sentencing Analytics¶
-- gold_doj_sentencing_analytics: Sentencing patterns by district
SELECT district_court, offense_category, fiscal_year,
COUNT(*) as case_count,
AVG(sentence_months) as avg_sentence_months,
PERCENTILE_APPROX(sentence_months, 0.5) as median_sentence_months,
AVG(CASE WHEN departure_type != 'None' THEN 1.0 ELSE 0.0 END) as departure_rate,
AVG(CASE WHEN plea_type = 'Guilty Plea' THEN 1.0 ELSE 0.0 END) as plea_rate,
AVG(fine_amount) as avg_fine
FROM silver_doj_federal_cases
GROUP BY district_court, offense_category, fiscal_year
5.4 Gold Table: Antitrust Metrics¶
-- gold_doj_antitrust_metrics: Market concentration and enforcement
SELECT industry_sector, industry_name, fiscal_year,
COUNT(CASE WHEN case_type = 'Merger Review' THEN 1 END) as merger_count,
AVG(CASE WHEN doj_action IN ('Challenged', 'Blocked') THEN 1.0 ELSE 0.0 END) as challenge_rate,
AVG(hhi_delta) as avg_hhi_delta,
SUM(penalty_amount_usd) as total_penalties,
COUNT(CASE WHEN cartel_type != 'None' THEN 1 END) as cartel_count
FROM silver_doj_antitrust
GROUP BY industry_sector, industry_name, fiscal_year
5.5 Gold Table: Drug Enforcement¶
-- gold_doj_drug_enforcement: DEA seizure analytics
SELECT region as dea_region, drug_type, fiscal_year,
COUNT(*) as seizure_count,
SUM(quantity_kg) as total_quantity_kg,
SUM(estimated_street_value_usd) as total_street_value,
AVG(arrests_count) as avg_arrests
FROM silver_doj_drug_enforcement
GROUP BY region, drug_type, fiscal_year
5.6 ZORDER Optimization¶
All Gold tables are ZORDER-optimized for Direct Lake query patterns:
OPTIMIZE gold_doj_crime_trends ZORDER BY (state_code, offense_category)
OPTIMIZE gold_doj_sentencing_analytics ZORDER BY (district_court, offense_category)
OPTIMIZE gold_doj_antitrust_metrics ZORDER BY (industry_sector, fiscal_year)
OPTIMIZE gold_doj_drug_enforcement ZORDER BY (dea_region, drug_type)
Checkpoint: Query each Gold table and verify aggregations produce expected metrics (clearance rates ~30-60%, plea rates ~90%+, positive seizure quantities).
🔍 Step 6: Data Quality Validation¶
6.1 Great Expectations Suites¶
Four GE suites validate data quality across all DOJ domains:
| Suite | Key Validations |
|---|---|
doj_crime_stats_suite.json | Unique incident_id, state_code regex ^[A-Z]{2}$, offense_category in set, victim_count >= 0 |
doj_federal_cases_suite.json | Unique case_id, sentence_months >= 0, defendant_age 18-75, departure_type in set |
doj_antitrust_suite.json | Unique case_id, industry_sector ^\d{2}$, HHI 0-10000, transaction_value > 0 |
doj_drug_enforcement_suite.json | Unique seizure_id, drug_type in 7 types, quantity_kg > 0, quarter 1-4 |
6.2 Running Validations¶
import great_expectations as gx
context = gx.get_context()
suite = context.get_expectation_suite("doj_crime_stats_suite")
results = context.run_checkpoint(checkpoint_name="doj_checkpoint")
print(f"Passed: {results.success}")
📊 Step 7: Power BI Direct Lake Dashboard¶
7.1 Create Semantic Model¶
- In your Fabric workspace, navigate to the Gold Lakehouse
- Click New semantic model → Name it
sm_doj_justice - Select all four
gold_doj_*tables - Set mode to Direct Lake
7.2 Key DAX Measures¶
// Crime clearance rate
Clearance Rate =
AVERAGE(gold_doj_crime_trends[clearance_rate])
// Total drug seizure street value
Total Seizure Value =
SUM(gold_doj_drug_enforcement[total_street_value])
// Antitrust challenge rate
Challenge Rate =
AVERAGE(gold_doj_antitrust_metrics[challenge_rate])
// Average sentence length
Avg Sentence =
AVERAGE(gold_doj_sentencing_analytics[avg_sentence_months])
// HHI concentration alert
High Concentration Markets =
CALCULATE(
COUNTROWS(gold_doj_antitrust_metrics),
gold_doj_antitrust_metrics[avg_hhi_delta] > 200
)
7.3 Report Pages¶
The Power BI report template (docs/powerbi/doj_justice_report.json) defines four dashboard pages:
- Crime Overview — State map, offense category bars, clearance rate cards, firearm involvement trends
- Sentencing Dashboard — District comparison bars, departure rate analysis, plea rate by offense
- Antitrust Analysis — HHI scatter plot with 2500 threshold lines, challenge rate heatmap, penalty waterfall
- Drug Enforcement — Seizure volume by drug type, street value by DEA region, quarterly trends
🔬 Step 8: Crime Analytics Deep Dive¶
Clearance Rate Analysis¶
Clearance rates vary dramatically by offense category and geography:
# Top 10 states by violent crime clearance rate
crime_trends = spark.table("lh_gold.gold_doj_crime_trends")
crime_trends.filter(F.col("offense_category") == "Persons") \
.groupBy("state_code") \
.agg(F.avg("clearance_rate").alias("avg_clearance")) \
.orderBy(F.desc("avg_clearance")) \
.show(10)
Offense Trend Analysis¶
# Year-over-year change in property crimes
crime_trends.filter(F.col("offense_category") == "Property") \
.groupBy("reporting_year") \
.agg(
F.sum("incident_count").alias("total_incidents"),
F.avg("yoy_change").alias("avg_yoy_pct")
) \
.orderBy("reporting_year") \
.show()
⚖️ Step 9: Antitrust Analytics Deep Dive¶
HHI Market Concentration¶
The Herfindahl-Hirschman Index (HHI) measures market concentration. The 2023 DOJ/FTC Merger Guidelines establish these thresholds:
| HHI Range | Classification | Enforcement Posture |
|---|---|---|
| < 1,500 | Unconcentrated | Generally no concern |
| 1,500 - 2,500 | Moderately Concentrated | Scrutiny if delta > 100 |
| > 2,500 | Highly Concentrated | Presumptive concern if delta > 200 |
# Markets most at risk of DOJ challenge
antitrust = spark.table("lh_gold.gold_doj_antitrust_metrics")
antitrust.filter(F.col("avg_hhi_delta") > 200) \
.select("industry_name", "fiscal_year", "merger_count",
"challenge_rate", "avg_hhi_delta", "total_penalties") \
.orderBy(F.desc("avg_hhi_delta")) \
.show(10)
Merger Review Pipeline¶
HSR Filing → Initial Review (30 days) → Second Request → Extended Review → Decision
│
┌───────────────────────────────┤
▼ ▼ ▼
Approved Consent Decree Challenged
│
┌─────────┤
▼ ▼
Blocked Abandoned
Challenge Rate by Industry¶
# Industries with highest DOJ challenge rates
antitrust.filter(F.col("merger_count") >= 5) \
.select("industry_name", "merger_count", "challenge_rate", "avg_hhi_delta") \
.orderBy(F.desc("challenge_rate")) \
.show(15)
💊 Step 10: Drug Enforcement Deep Dive¶
Seizure Analysis by Drug Type¶
drug = spark.table("lh_gold.gold_doj_drug_enforcement")
# Total seizures and street value by drug type
drug.groupBy("drug_type") \
.agg(
F.sum("total_quantity_kg").alias("total_kg"),
F.sum("total_street_value").alias("total_value"),
F.sum("seizure_count").alias("total_seizures")
) \
.orderBy(F.desc("total_value")) \
.show()
Regional Enforcement Patterns¶
# Top DEA regions by seizure volume
drug.groupBy("dea_region") \
.agg(
F.sum("total_quantity_kg").alias("total_kg"),
F.sum("total_street_value").alias("total_value"),
F.avg("avg_arrests").alias("avg_arrests_per_seizure")
) \
.orderBy(F.desc("total_kg")) \
.show(10)
🔗 Step 11: Cross-Domain Analysis¶
DOJ x SBA: Market Consolidation Impact on Small Business Lending¶
# Join antitrust metrics with SBA lending by industry (NAICS)
antitrust = spark.table("lh_gold.gold_doj_antitrust_metrics")
sba = spark.table("lh_gold.gold_sba_economic_impact")
cross = antitrust.join(sba,
antitrust.industry_sector == sba.naics_code,
"inner")
# Do highly concentrated markets see less small business lending?
cross.groupBy("concentration_level") \
.agg(
F.avg("total_loan_amount").alias("avg_sba_lending"),
F.avg("avg_hhi_delta").alias("avg_concentration_change")
) \
.show()
DOJ x EPA: Environmental Enforcement Correlation¶
# States with high environmental violations and DOJ prosecution rates
crime = spark.table("lh_gold.gold_doj_crime_trends")
epa = spark.table("lh_gold.gold_epa_compliance_trends")
env_crime = crime.filter(F.col("offense_category") == "Society") \
.join(epa, crime.state_code == epa.state, "inner") \
.select("state_code", "incident_count", "violation_count", "clearance_rate")
❓ Step 12: Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
| FBI API returns 403 | Missing/invalid API key | Sign up at api.usa.gov |
weighted_choice error | Weight arrays don't sum to 1.0 | Regenerate with seed=42 to use normalized weights |
| HHI values null | Non-merger case type | HHI fields only populated for case_type = "Merger Review" |
| Bronze table empty | Source path mismatch | Check SOURCE_PATHS in notebook match your data location |
| Silver DQ score = 0 | Invalid reference data | Verify NIBRS codes and district names in generator |
| Gold ZORDER fails | Small table | ZORDER requires sufficient data volume; skip for < 1000 rows |
📚 Step 13: Published References¶
Federal Open Data Sources¶
| Resource | URL | Description |
|---|---|---|
| FBI Crime Data Explorer | https://cde.ucr.cjis.gov | NIBRS crime statistics, agency data, arrest records |
| USSC Research & Data | https://www.ussc.gov/research | Federal sentencing data files, annual reports |
| DOJ Open Data Portal | https://www.justice.gov/open/open-data | Cross-division datasets |
| DOJ Antitrust Case Filings | https://www.justice.gov/atr/antitrust-case-filings | Merger reviews, cartel prosecutions |
| 2023 Merger Guidelines | https://www.justice.gov/atr/2023-merger-guidelines | Current HHI thresholds and framework |
| DEA Data & Statistics | https://www.dea.gov/data-and-statistics | Drug seizure data, scheduling |
| BOP Statistics | https://www.bop.gov/about/statistics/ | Federal prison population data |
| BJS (Bureau of Justice Statistics) | https://bjs.ojp.gov | NCVS, prosecution, courts data |
| Vera Institute | https://github.com/vera-institute/incarceration-trends | County-level incarceration data |
| ICPSR NACJD | https://www.icpsr.umich.edu/web/pages/NACJD/index.html | 3,683 criminal justice studies |
Related Use Cases¶
- Antitrust Analytics — Deep dive into HHI concentration analysis and merger review
- Federal Justice Analytics — Crime, sentencing, and prosecution pipeline
- References — Complete curated resource catalog
✅ Step 14: Summary¶
In this tutorial you built a complete DOJ Justice Analytics pipeline:
| What You Built | Details |
|---|---|
| Data Generator | DOJGenerator with 4 domains, 29 unit tests |
| Open Data Download | 5 real federal data sources (FBI, USSC, Antitrust, HSR, DEA) |
| Bronze Layer | 4 raw ingestion tables with schema enforcement |
| Silver Layer | NIBRS validation, HHI classification, DQ scoring, dedup |
| Gold Layer | 4 aggregation tables with ZORDER optimization |
| Data Quality | 4 Great Expectations suites |
| Power BI | 4-page Direct Lake dashboard |
| Cross-Domain | DOJ x SBA, DOJ x EPA analytical joins |
Key Concepts Learned¶
- NIBRS offense codes — National Incident-Based Reporting System classification
- HHI concentration — 2023 DOJ/FTC Merger Guidelines thresholds
- Federal sentencing guidelines — Departure types, criminal history categories
- DEA drug scheduling — Controlled Substances Act schedule classification
- Cross-domain analytics — Joining federal datasets on shared dimensions (state, NAICS)
Files Created/Used¶
| File | Purpose |
|---|---|
data_generation/generators/federal/doj_generator.py | DOJ data generator |
data_generation/open_data/doj_download.py | Real data downloader |
notebooks/bronze/18_bronze_doj.py | Bronze ingestion |
notebooks/silver/18_silver_doj.py | Silver transformations |
notebooks/gold/19_gold_doj_analytics.py | Gold analytics |
validation/great_expectations/expectations/doj_*.json | GE suites (4) |
docs/powerbi/doj_justice_report.json | Power BI template |
🎉 Congratulations! You've completed Tutorial 38 — the final tutorial in the Supercharge Microsoft Fabric series. You now have a complete DOJ Justice Analytics pipeline ready for production use.
| Navigation | |
|---|---|
| ⬅️ Previous | 37-Graph Analytics |
| 🏠 Home | README |
| 📖 All Tutorials | Tutorial Index |