Skip to content

Source: examples/dot/README.md — this page is rendered live from that file.

DOT Transportation Analytics Platform

Examples > DOT

[!TIP] TL;DR — Transportation analytics covering highway safety (FARS crash data), infrastructure condition (620K+ bridges from NBI), and transit performance (900+ agencies from NTD). Includes geospatial hotspot analysis and bridge priority scoring.


📋 Table of Contents

A comprehensive transportation analytics platform built on Azure Cloud Scale Analytics (CSA), providing insights into highway safety, infrastructure conditions, transit performance, and aviation operations using official Department of Transportation data sources.


📋 Overview

The U.S. Department of Transportation oversees one of the most data-rich domains in the federal government — from 6.4 million reported crashes annually to real-time transit feeds from 900+ agencies. This platform ingests, processes, and analyzes data from NHTSA, FHWA, FAA, and FTA to enable evidence-based transportation policy, safety interventions, and infrastructure investment decisions.

✨ Key Features

  • Highway Safety Analytics: FARS fatality analysis with geospatial hotspot detection
  • Infrastructure Health Scoring: Bridge and pavement condition prioritization using NBI data
  • Transit Performance Monitoring: NTD ridership trends with on-time performance analysis
  • Aviation Operations: FAA airport capacity and delay pattern analysis
  • Predictive Models: Crash risk scoring and infrastructure deterioration forecasting
  • Interactive Dashboards: Multi-modal transportation KPIs with drill-down by state and corridor

🗄️ Data Sources

Source Agency Description URL
FARS NHTSA Fatality Analysis Reporting System — all fatal motor vehicle crashes https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars
NBI FHWA National Bridge Inventory — condition ratings for 620K+ bridges https://www.fhwa.dot.gov/bridge/nbi.cfm
HPMS FHWA Highway Performance Monitoring System — road condition and traffic https://www.fhwa.dot.gov/policyinformation/hpms.cfm
NTD FTA National Transit Database — ridership, financials, safety for all transit https://www.transit.dot.gov/ntd
ATADS FAA Air Traffic Activity Data System — airport operations counts https://aspm.faa.gov/opsnet/sys/Airport.asp
BTS OST Bureau of Transportation Statistics — cross-modal data https://www.bts.gov/browse-statistical-products-and-data

🏗️ Architecture Overview

graph TD
    subgraph "Data Sources"
        A1[NHTSA FARS API<br/>Fatal Crash Records]
        A2[FHWA NBI<br/>Bridge Inventory]
        A3[FTA NTD<br/>Transit Ridership]
        A4[FAA ATADS<br/>Airport Operations]
        A5[BTS APIs<br/>Cross-Modal Stats]
        A6[data.transportation.gov<br/>Open Data Portal]
    end

    subgraph "Ingestion Layer"
        I1[ADF Pipeline<br/>Batch Ingestion]
        I2[REST Connectors<br/>API Polling]
    end

    subgraph "Bronze Layer — Raw"
        B1[brz_fars_accidents]
        B2[brz_fars_vehicles]
        B3[brz_fars_persons]
        B4[brz_nbi_bridges]
        B5[brz_ntd_ridership]
        B6[brz_faa_operations]
    end

    subgraph "Silver Layer — Cleansed"
        S1[slv_crash_events]
        S2[slv_bridge_conditions]
        S3[slv_transit_performance]
        S4[slv_airport_operations]
    end

    subgraph "Gold Layer — Analytics"
        G1[gld_crash_hotspots]
        G2[gld_bridge_priority_index]
        G3[gld_transit_ridership_trends]
        G4[gld_airport_delay_patterns]
        G5[gld_transportation_dashboard]
    end

    subgraph "Consumption"
        C1[Safety Dashboard]
        C2[Infrastructure Reports]
        C3[Transit Scorecards]
        C4[Policy Analysis APIs]
    end

    A1 --> I1
    A2 --> I1
    A3 --> I2
    A4 --> I2
    A5 --> I2
    A6 --> I1

    I1 --> B1
    I1 --> B2
    I1 --> B3
    I1 --> B4
    I2 --> B5
    I2 --> B6

    B1 --> S1
    B2 --> S1
    B3 --> S1
    B4 --> S2
    B5 --> S3
    B6 --> S4

    S1 --> G1
    S2 --> G2
    S3 --> G3
    S4 --> G4
    S1 --> G5
    S2 --> G5
    S3 --> G5
    S4 --> G5

    G1 --> C1
    G2 --> C2
    G3 --> C3
    G5 --> C4

📎 Prerequisites

Azure Resources

  • Azure subscription with contributor access
  • Azure Data Factory or Synapse Analytics
  • Azure Data Lake Storage Gen2
  • Azure SQL Database or Synapse SQL Pool
  • Azure Key Vault for API credentials

Tools Required

  • Azure CLI (2.55.0 or later)
  • dbt CLI (1.7.0 or later)
  • Python 3.9+
  • Git

API Access

  • NHTSA FARS API (no key required — open access at https://crashviewer.nhtsa.dot.gov/CrashAPI)
  • data.transportation.gov Socrata API (app token recommended: https://data.transportation.gov)
  • BTS API key (free registration at https://www.bts.gov/developer)

🚀 Quick Start

1. Environment Setup

# Clone the repository
git clone <repository-url>
cd csa-inabox/examples/dot

# Install Python dependencies
pip install -r requirements.txt

# Install dbt packages
cd domains/dbt
dbt deps

2. Configure API Keys

# Add to Azure Key Vault or local environment
export BTS_API_KEY="your-bts-api-key"
export SOCRATA_APP_TOKEN="your-socrata-app-token"  # Optional, raises rate limits

3. Generate Sample Data

# Generate synthetic data (fallback if APIs unavailable)
python data/generators/generate_dot_data.py --output-dir domains/dbt/seeds

# Or fetch real data from APIs
python data/open-data/fetch_fars.py --years "2020,2021,2022" --states "TX,CA,FL"
python data/open-data/fetch_nbi.py --states "TX,CA,FL"
python data/open-data/fetch_ntd.py --report-year 2022

4. Deploy Infrastructure

# Configure parameters
cp deploy/params.dev.json deploy/params.local.json
# Edit params.local.json with your values

# Deploy using Azure CLI
az deployment group create \
  --resource-group rg-dot-analytics \
  --template-file ../../deploy/bicep/DLZ/main.bicep \
  --parameters @deploy/params.local.json

5. Run dbt Models

cd domains/dbt

# Test connections
dbt debug

# Load seed data
dbt seed

# Run models
dbt run

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve

💡 Sample Analytics Scenarios

1. Highway Safety Hotspot Analysis

Identify road segments with statistically significant crash clustering using FARS data, enabling targeted safety interventions.

-- Top crash corridors by severity-weighted score
SELECT
    state_name,
    county_name,
    route_id,
    total_fatalities,
    total_crashes,
    severity_weighted_score,
    drunk_driving_pct,
    pedestrian_involved_pct,
    speed_related_pct,
    LAG(severity_weighted_score) OVER (
        PARTITION BY state_name ORDER BY severity_weighted_score DESC
    ) AS prev_corridor_score
FROM gold.gld_crash_hotspots
WHERE year = 2022
    AND total_crashes >= 10
ORDER BY severity_weighted_score DESC
LIMIT 25;

2. Infrastructure Maintenance Prioritization

Score and rank 620,000+ bridges using NBI condition ratings, traffic volume, and deterioration velocity to optimize limited maintenance budgets.

-- Bridges requiring urgent attention
SELECT
    structure_number,
    facility_carried,
    state_name,
    county_name,
    year_built,
    deck_condition,
    superstructure_condition,
    substructure_condition,
    average_daily_traffic,
    sufficiency_rating,
    priority_index,
    estimated_repair_cost_usd
FROM gold.gld_bridge_priority_index
WHERE priority_index >= 80
    AND sufficiency_rating < 50
ORDER BY priority_index DESC, average_daily_traffic DESC
LIMIT 50;

3. Transit Ridership Trend Analysis

Track ridership recovery patterns post-pandemic across transit modes and metropolitan areas using NTD data.

-- Ridership recovery by mode and metro area
SELECT
    metropolitan_area,
    transit_mode,
    ridership_2019_baseline,
    ridership_current,
    recovery_pct,
    revenue_per_trip,
    cost_per_trip,
    farebox_recovery_ratio
FROM gold.gld_transit_ridership_trends
WHERE report_year = 2023
ORDER BY recovery_pct DESC;

✨ Data Products

Crash Hotspots (crash-hotspots)

  • Description: Geospatially clustered fatal crash data with severity scoring
  • Freshness: Annual (FARS final release) with monthly preliminary data
  • Coverage: 2001–present, all 50 states + DC + territories
  • API: /api/v1/crash-hotspots

Bridge Priority Index (bridge-priority)

  • Description: NBI bridge condition data with computed maintenance priority scores
  • Freshness: Annual updates (NBI submission cycle)
  • Coverage: All federally inspected bridges (~620,000 structures)
  • API: /api/v1/bridge-priority
  • Description: NTD ridership, financials, and performance metrics by agency and mode
  • Freshness: Monthly (preliminary) / Annual (final)
  • Coverage: 900+ transit agencies, all modes
  • API: /api/v1/transit-ridership

⚙️ Configuration

⚙️ dbt Profiles

Add to your ~/.dbt/profiles.yml:

dot_analytics:
  target: dev
  outputs:
    dev:
      type: databricks
      host: "{{ env_var('DBT_HOST') }}"
      http_path: "{{ env_var('DBT_HTTP_PATH') }}"
      token: "{{ env_var('DBT_TOKEN') }}"
      schema: dot_dev
      catalog: dev
    prod:
      type: databricks
      host: "{{ env_var('DBT_HOST_PROD') }}"
      http_path: "{{ env_var('DBT_HTTP_PATH_PROD') }}"
      token: "{{ env_var('DBT_TOKEN_PROD') }}"
      schema: dot
      catalog: prod

⚙️ Environment Variables

# Required for data fetching
BTS_API_KEY=your-bts-api-key
SOCRATA_APP_TOKEN=your-socrata-app-token

# Required for dbt
DBT_HOST=your-databricks-host
DBT_HTTP_PATH=your-sql-warehouse-path
DBT_TOKEN=your-access-token

# Optional
DOT_LOG_LEVEL=INFO
DOT_BATCH_SIZE=5000

🔒 Azure Government Notes

This example is compatible with Azure Government (US) regions. When deploying to Azure Government:

  • Use usgovvirginia or usgovarizona as your Azure region
  • Update ARM/Bicep endpoint references to .usgovcloudapi.net
  • FARS and NBI data are publicly accessible from both commercial and government networks
  • Ensure data residency requirements are met for any PII in crash records (person-level FARS data)

📊 Monitoring & Alerts

  • Data Freshness: Alerts when FARS annual release or NTD monthly data is overdue
  • Data Quality: Automated dbt tests on all models with notification hooks
  • Pipeline Health: ADF pipeline run monitoring with failure alerts
  • Cost Management: Daily compute spend tracking with anomaly detection

🔧 Troubleshooting

🔧 Common Issues

  1. FARS API Pagination: The CrashAPI returns max 5,000 records per call. Use the --paginate flag in fetch scripts.
  2. NBI File Format Changes: FHWA occasionally updates field layouts. Check data/schemas/nbi_layout.json for the current mapping.
  3. NTD Data Lag: Final NTD data lags 12–18 months. Use the monthly module for preliminary figures.
  4. Large HPMS Downloads: HPMS shapefiles can exceed 10 GB. Use the --state-filter parameter.

🔗 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-data-source)
  3. Make changes and add tests
  4. Run quality checks (make lint test)
  5. Submit a pull request

🔗 License

This project is licensed under the MIT License. See LICENSE file for details.


🔗 Acknowledgments

  • NHTSA, FHWA, FTA, and FAA for maintaining comprehensive public transportation data
  • Azure Cloud Scale Analytics team for the foundational platform
  • Contributors and the open-source community


Prerequisites / Cost / Teardown

[!IMPORTANT] Cost-safety: this vertical deploys real Azure resources. Always run teardown.sh when you are done. A forgotten workshop environment can run $120-200/day.

Prerequisites

  • Azure CLI 2.50+ logged in (az login), subscription selected (az account set --subscription <id>)
  • jq installed (used by teardown enumeration)
  • Bicep CLI 0.25+ (az bicep version)
  • Contributor + User Access Administrator on target subscription (or a pre-created RG with equivalent RBAC)
  • bash scripts/deploy/validate-prerequisites.sh passes

Cost estimate (rough, East US 2)

  • While running: ~$$120-200/day (services: Synapse, Databricks, ADF, ADX, Storage, Key Vault)
  • Idle overnight: roughly half if you stop compute (Databricks autostop + Synapse pause)
  • Storage + Key Vault residual: <$5/month if you skip teardown

Numbers are indicative for a small demo dataset; production workloads vary significantly. Use az consumption usage list or Cost Management for live numbers.

Runtime

  • Deploy: ~30-45 minutes (first run; cold Bicep)
  • Teardown: ~10-15 minutes (async RG delete completes in the background)

Teardown

When finished, run the per-example teardown script. It enforces a typed DESTROY-dot confirmation, logs every step to reports/teardown/dot-<timestamp>.log, and deletes the resource group rg-dot-analytics along with any matching subscription-scope deployments.

# Interactive (recommended)
bash examples/dot/deploy/teardown.sh

# Dry run (enumerate only)
bash examples/dot/deploy/teardown.sh --dry-run

# From the repo root via Makefile
make teardown-example VERTICAL=dot
make teardown-example VERTICAL=dot DRYRUN=1

# CI automation (no prompt — only for ephemeral environments)
bash examples/dot/deploy/teardown.sh --yes

See docs/QUICKSTART.md#teardown for the platform-wide teardown flow.

Directory Structure

dot/
├── contracts/                # Data product contracts (schemas, SLOs, owners)
│   ├── crash-analytics.yaml
│   ├── highway-conditions.yaml
│   └── transit-performance.yaml
├── data/                     # Sample data + synthetic generators
│   ├── generators/
│   └── open-data/
├── deploy/                   # Deployment parameters / Bicep templates
│   ├── params.dev.json
│   ├── params.gov.json
│   └── teardown.sh
├── domains/                  # dbt models (bronze / silver / gold) and seeds
│   └── dbt/
├── notebooks/                # Synapse / Fabric / Databricks notebooks
│   ├── crash_severity_analysis.py
│   └── highway_safety_prediction.py
├── reports/                  # Power BI report templates and pbix sources
├── ARCHITECTURE.md           # Mermaid + prose architecture diagrams
└── README.md                 # This file

Expected Results

After running the medallion pipeline against the bundled seed data, the Gold layer should populate the following tables. Row counts vary with the seed-data generator parameters; the figures below are the approximate scale you should see on a default run.

Gold Table Approximate Rows Notes
gld_infrastructure_priority TODO: capture after first run Populated from Silver via dbt --select tag:gold
gld_safety_hotspots TODO: capture after first run Populated from Silver via dbt --select tag:gold
gld_transit_dashboard TODO: capture after first run Populated from Silver via dbt --select tag:gold

TODO: capture exact counts after the next end-to-end seed run. These are bounded by the seed-data generator parameters in data/generators/.