Source:
examples/epa/README.md— this page is rendered live from that file.
CIPSEA awareness
The data in this example may be subject to CIPSEA (the Confidential Information Protection and Statistical Efficiency Act, 44 U.S.C. §§ 3561–3583) when collected from respondents under a pledge of confidentiality for exclusively statistical purposes.
Knowing and willful disclosure of identifiable CIPSEA data is a Class E felony (§ 3572) attaching to individual officers, employees, or designated agents — including cloud-operator personnel where applicable.
The architecture below is starting-point reference guidance only. Validate the specific compliance posture for your workload with your designating statistical agency and Confidentiality Officer before production use:
- CIPSEA control mapping & narrative (DRAFT — under validation)
- CIPSEA operational playbook for Azure (DRAFT — under validation)
EPA Environmental Monitoring Analytics Platform¶
Examples > EPA
[!TIP] TL;DR — Environmental monitoring platform with real-time AQI streaming from 4,000+ stations, water safety tracking for 25,000+ systems, toxic release analysis, and environmental justice scoring with ML-based air quality prediction.
📋 Table of Contents¶
- Overview
- Key Features
- Data Sources
- Architecture Overview
- Real-Time AQI Streaming Architecture
- Streaming Quick Start
- Sample KQL — Real-Time AQI Alerts
- Prerequisites
- Azure Resources
- Tools Required
- API Access
- Quick Start
- 1. Environment Setup
- 2. Configure API Keys
- 3. Generate Sample Data
- 4. Deploy Infrastructure
- 5. Run dbt Models
- Sample Analytics Scenarios
- 1. Air Quality Prediction with ML
- 2. Environmental Justice Analysis
- 3. Emissions Compliance Monitoring
- Data Products
- AQI Prediction
- Environmental Justice
- Emissions Compliance
- Configuration
- dbt Profiles
- Environment Variables
- Azure Government Notes
- Monitoring & Alerts
- Troubleshooting
- Common Issues
- Contributing
- License
- Acknowledgments
A comprehensive environmental monitoring analytics platform built on Azure Cloud Scale Analytics (CSA), providing insights into air quality, water safety, toxic releases, and environmental justice using official EPA data sources — including real-time AQI sensor streaming for near-real-time air quality dashboards.
📋 Overview¶
The Environmental Protection Agency monitors air quality at 4,000+ stations, tracks 25,000+ drinking water systems, catalogs toxic releases from 20,000+ facilities, and manages 1,300+ Superfund sites. This platform ingests, processes, and analyzes EPA data to enable air quality prediction, environmental justice analysis, and emissions compliance monitoring. The streaming pipeline demonstrates real-time AQI sensor data flowing through Azure Event Hub for sub-minute air quality alerting.
✨ Key Features¶
- Real-Time Air Quality Monitoring: Live AQI data via Event Hub with ML-based prediction
- Drinking Water Safety: SDWIS violation tracking with risk-based prioritization
- Toxic Release Tracking: TRI facility emissions with trend analysis and community impact
- Superfund Site Management: Cleanup progress tracking and community exposure assessment
- Environmental Justice Analysis: Overlay pollution data with socioeconomic indicators
- Regulatory Compliance Dashboards: Facility-level NESHAP, NPDES, and RCRA compliance
🗄️ Data Sources¶
| Source | Description | URL |
|---|---|---|
| AirNow API | Real-time and forecast AQI for 500+ metro areas | https://docs.airnowapi.org/ |
| AQS (Air Quality System) | Historical ambient air quality data from monitors | https://aqs.epa.gov/aqsweb/documents/data_api.html |
| SDWIS | Safe Drinking Water Information System — violations & compliance | https://www.epa.gov/enviro/sdwis-search |
| TRI (Toxics Release Inventory) | Chemical releases from industrial facilities | https://www.epa.gov/toxics-release-inventory-tri-program/tri-data-and-tools |
| ECHO | Enforcement and Compliance History Online | https://echo.epa.gov/tools/web-services |
| Envirofacts API | Multi-program environmental data gateway | https://www.epa.gov/enviro/envirofacts-data-service-api |
| EJScreen | Environmental justice screening and mapping | https://www.epa.gov/ejscreen |
| Superfund / CERCLIS | NPL site locations, contaminants, and cleanup status | https://www.epa.gov/superfund/superfund-data-and-reports |
🏗️ Architecture Overview¶
graph TD
subgraph "Data Sources"
A1[AirNow API<br/>Real-Time AQI]
A2[AQS API<br/>Historical Air Quality]
A3[SDWIS<br/>Water System Violations]
A4[TRI<br/>Toxic Releases]
A5[ECHO<br/>Compliance History]
A6[EJScreen<br/>Environmental Justice]
A7[Superfund<br/>NPL Sites]
end
subgraph "Streaming Pipeline"
SEN[AQI Sensor<br/>Network]
EH[Azure Event Hub<br/>aqi-sensor-events]
ADX[Azure Data Explorer<br/>Hot Analytics]
ALERT[Alert Engine<br/>AQI Threshold Triggers]
RT[Real-Time AQI<br/>Dashboard]
end
subgraph "Batch Ingestion"
I1[ADF Pipeline<br/>Scheduled Loads]
I2[REST Connectors<br/>API Polling]
end
subgraph "Bronze Layer — Raw"
B1[brz_aqi_observations]
B2[brz_water_violations]
B3[brz_toxic_releases]
B4[brz_compliance_records]
B5[brz_ejscreen_indicators]
B6[brz_superfund_sites]
end
subgraph "Silver Layer — Cleansed"
S1[slv_air_quality]
S2[slv_water_safety]
S3[slv_facility_emissions]
S4[slv_compliance_status]
S5[slv_ej_demographics]
end
subgraph "Gold Layer — Analytics"
G1[gld_aqi_prediction]
G2[gld_ej_analysis]
G3[gld_emissions_compliance]
G4[gld_water_risk_index]
G5[gld_environmental_dashboard]
end
subgraph "Consumption"
C1[AQI Dashboard]
C2[EJ Mapping Tool]
C3[Compliance Reports]
C4[Public Health APIs]
end
SEN --> EH
EH --> ADX
ADX --> ALERT
ADX --> RT
A1 --> I2
A2 --> I1
A3 --> I1
A4 --> I1
A5 --> I2
A6 --> I1
A7 --> I1
I1 --> B1
I1 --> B2
I1 --> B3
I2 --> B4
I1 --> B5
I1 --> B6
B1 --> S1
B2 --> S2
B3 --> S3
B4 --> S4
B5 --> S5
S1 --> G1
S1 --> G2
S3 --> G2
S5 --> G2
S3 --> G3
S4 --> G3
S2 --> G4
S1 --> G5
S2 --> G5
S3 --> G5
G1 --> C1
G2 --> C2
G3 --> C3
G5 --> C4 ⚡ Real-Time AQI Streaming Architecture¶
This example includes a streaming pipeline for near-real-time air quality index monitoring:
graph LR
subgraph "Sensor Network"
S1[PM2.5 Monitor<br/>Site 060371103]
S2[Ozone Monitor<br/>Site 170314201]
S3[NO2 Monitor<br/>Site 360610079]
end
subgraph "Ingestion"
SIM[AQI Sensor Simulator<br/>Python Script]
EH[Azure Event Hub<br/>aqi-sensor-events]
end
subgraph "Hot Path — Real-Time"
ADX[Azure Data Explorer]
KQL[KQL Continuous Queries]
ALERT[Azure Monitor Alerts<br/>AQI > 150 Trigger]
DASH[Power BI Real-Time<br/>AQI Heatmap]
end
subgraph "Warm Path — Batch"
SA[Stream Analytics<br/>5-min Aggregations]
ADLS[ADLS Gen2<br/>Bronze Parquet]
end
S1 --> SIM
S2 --> SIM
S3 --> SIM
SIM --> EH
EH --> ADX
ADX --> KQL
KQL --> ALERT
KQL --> DASH
EH --> SA
SA --> ADLS 🚀 Streaming Quick Start¶
# Start the AQI sensor simulator
python streaming/aqi_sensor_simulator.py \
--event-hub-connection "$EVENTHUB_CONNECTION_STRING" \
--sites "060371103,170314201,360610079" \
--pollutants "PM2.5,O3,NO2,CO" \
--interval-seconds 60
# Deploy ADX table and ingestion mapping
az kusto script create \
--cluster-name epa-adx \
--database-name airquality \
--resource-group rg-epa-analytics \
--script-content @streaming/adx/create_tables.kql
⚡ Sample KQL — Real-Time AQI Alerts¶
// Locations exceeding "Unhealthy" AQI in the last 30 minutes
AqiSensorEvents
| where ingestion_time() > ago(30m)
| summarize avg_aqi = avg(aqi_value),
max_aqi = max(aqi_value),
readings = count()
by site_id, pollutant, bin(event_time, 5m)
| where avg_aqi > 150
| extend severity = case(
avg_aqi > 300, "Hazardous",
avg_aqi > 200, "Very Unhealthy",
avg_aqi > 150, "Unhealthy",
"Moderate")
| order by avg_aqi desc
📎 Prerequisites¶
Azure Resources¶
- Azure subscription with contributor access
- Azure Data Factory or Synapse Analytics
- Azure Data Lake Storage Gen2
- Azure Data Explorer cluster (for AQI streaming)
- Azure Event Hub namespace (for AQI streaming)
- Azure Machine Learning workspace (optional, for AQI prediction)
- Azure Key Vault for API credentials
Tools Required¶
- Azure CLI (2.55.0 or later)
- dbt CLI (1.7.0 or later)
- Python 3.9+
- Git
API Access¶
- AirNow API key (free at https://docs.airnowapi.org/account/request/)
- EPA AQS API credentials (email/key at https://aqs.epa.gov/aqsweb/documents/data_api.html)
- ECHO API (no key required — open access)
- Envirofacts (no key required — open access)
🚀 Quick Start¶
1. Environment Setup¶
# Clone the repository
git clone <repository-url>
cd csa-inabox/examples/epa
# Install Python dependencies
pip install -r requirements.txt
# Install dbt packages
cd domains/dbt
dbt deps
2. Configure API Keys¶
# Add to Azure Key Vault or local environment
export AIRNOW_API_KEY="your-airnow-api-key"
export AQS_EMAIL="your-email@example.com"
export AQS_KEY="your-aqs-api-key"
export EVENTHUB_CONNECTION_STRING="your-eventhub-connection" # For streaming
3. Generate Sample Data¶
# Generate synthetic environmental data
python data/generators/generate_epa_data.py --output-dir domains/dbt/seeds
# Or fetch real data from APIs
python data/open-data/fetch_airnow.py --bbox "-125,24,-66,50" --parameters "PM2.5,O3"
python data/open-data/fetch_tri.py --states "TX,LA,CA" --years "2020,2021,2022"
python data/open-data/fetch_sdwis.py --states "MI,OH,PA"
python data/open-data/fetch_echo.py --program "CAA" --state "CA"
4. Deploy Infrastructure¶
# Configure parameters
cp deploy/params.dev.json deploy/params.local.json
# Edit params.local.json with your values
# Deploy using Azure CLI
az deployment group create \
--resource-group rg-epa-analytics \
--template-file ../../deploy/bicep/DLZ/main.bicep \
--parameters @deploy/params.local.json
5. Run dbt Models¶
cd domains/dbt
# Test connections
dbt debug
# Load seed data
dbt seed
# Run models
dbt run
# Run tests
dbt test
# Generate documentation
dbt docs generate
dbt docs serve
💡 Sample Analytics Scenarios¶
1. Air Quality Prediction with ML¶
Use historical AQI data combined with meteorological features to predict next-day air quality for proactive health advisories.
-- AQI prediction performance by metro area
SELECT
metro_area,
pollutant,
prediction_date,
predicted_aqi,
actual_aqi,
prediction_error,
model_confidence,
contributing_factors,
health_advisory_level
FROM gold.gld_aqi_prediction
WHERE prediction_date >= CURRENT_DATE - INTERVAL '30 days'
AND pollutant = 'PM2.5'
ORDER BY metro_area, prediction_date;
2. Environmental Justice Analysis¶
Overlay TRI toxic release data and AQI readings with Census socioeconomic indicators to identify disproportionately burdened communities.
-- Communities with highest environmental burden
SELECT
census_tract,
state,
county,
population,
pct_minority,
pct_low_income,
avg_aqi,
tri_facilities_within_3mi,
total_chemical_releases_lbs,
superfund_sites_within_5mi,
ej_burden_score,
ej_percentile
FROM gold.gld_ej_analysis
WHERE ej_percentile >= 90
ORDER BY ej_burden_score DESC
LIMIT 50;
📊 3. Emissions Compliance Monitoring¶
Track facility-level compliance with Clean Air Act (CAA), Clean Water Act (CWA), and RCRA regulations, identifying repeat violators and enforcement gaps.
-- Facilities with ongoing violations
SELECT
facility_id,
facility_name,
city,
state,
program_area,
violation_type,
violation_start_date,
days_in_violation,
penalties_assessed_usd,
penalties_collected_usd,
inspection_count_3yr,
compliance_status
FROM gold.gld_emissions_compliance
WHERE compliance_status = 'SIGNIFICANT_VIOLATION'
AND days_in_violation > 180
ORDER BY days_in_violation DESC;
✨ Data Products¶
AQI Prediction (aqi-prediction)¶
- Description: Next-day AQI forecasts with confidence intervals by metro area
- Freshness: Daily model runs, real-time streaming for current conditions
- Coverage: 500+ metro areas, 6 criteria pollutants
- API:
/api/v1/aqi-prediction
Environmental Justice (ej-analysis)¶
- Description: Census tract-level environmental burden scoring with demographic overlays
- Freshness: Quarterly updates (aligned with EJScreen releases)
- Coverage: All U.S. Census tracts (~74,000)
- API:
/api/v1/ej-analysis
Emissions Compliance (emissions-compliance)¶
- Description: Facility-level regulatory compliance status across CAA, CWA, RCRA
- Freshness: Monthly updates from ECHO
- Coverage: 800,000+ regulated facilities
- API:
/api/v1/emissions-compliance
⚙️ Configuration¶
⚙️ dbt Profiles¶
Add to your ~/.dbt/profiles.yml:
epa_analytics:
target: dev
outputs:
dev:
type: databricks
host: "{{ env_var('DBT_HOST') }}"
http_path: "{{ env_var('DBT_HTTP_PATH') }}"
token: "{{ env_var('DBT_TOKEN') }}"
schema: epa_dev
catalog: dev
prod:
type: databricks
host: "{{ env_var('DBT_HOST_PROD') }}"
http_path: "{{ env_var('DBT_HTTP_PATH_PROD') }}"
token: "{{ env_var('DBT_TOKEN_PROD') }}"
schema: epa
catalog: prod
⚙️ Environment Variables¶
# Required for data fetching
AIRNOW_API_KEY=your-airnow-api-key
AQS_EMAIL=your-email@example.com
AQS_KEY=your-aqs-api-key
EVENTHUB_CONNECTION_STRING=your-eventhub-connection
# Required for dbt
DBT_HOST=your-databricks-host
DBT_HTTP_PATH=your-sql-warehouse-path
DBT_TOKEN=your-access-token
# Optional
EPA_LOG_LEVEL=INFO
EPA_BATCH_SIZE=5000
ADX_CLUSTER_URI=https://epa-adx.region.kusto.windows.net
🔒 Azure Government Notes¶
This example is compatible with Azure Government (US) regions. When deploying to Azure Government:
- Use
usgovvirginiaorusgovarizonaas your Azure region - Update ARM/Bicep endpoint references to
.usgovcloudapi.net - ADX and Event Hub are available in Azure Government
- AirNow and AQS APIs are publicly accessible from government networks
- ECHO data may contain enforcement-sensitive details — confirm classification with your ISSO
- EJScreen data is public and unrestricted
📊 Monitoring & Alerts¶
- Streaming Health: Event Hub throughput, ADX ingestion latency, alert trigger rates
- AQI Thresholds: Automated alerts when AQI exceeds 100 (Unhealthy for Sensitive Groups)
- Data Freshness: Alerts when AirNow feeds or TRI annual submissions are overdue
- Data Quality: Automated tests on pollutant ranges, geographic bounds, and completeness
- Cost Management: ADX auto-scale monitoring with spend guardrails
🔧 Troubleshooting¶
🔧 Common Issues¶
- AirNow API Rate Limits: Limited to 500 requests/hour per key. Use the
--delayparameter and cache responses. - AQS Data Lag: Quality-assured AQS data lags 6–12 months behind real-time AirNow data. Use AirNow for current conditions, AQS for historical analysis.
- TRI Reporting Year Lag: TRI data is reported annually with an 18-month delay. The most recent year available is typically two years prior.
- ECHO Pagination: ECHO API returns max 10,000 records per query. Use
--state-filterand--program-filterto partition requests. - Sensor Simulator Memory: For large-scale simulations (100+ sites), increase the
--batch-sizeparameter to reduce event hub calls.
🔗 Contributing¶
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-data-source) - Make changes and add tests
- Run quality checks (
make lint test) - Submit a pull request
🔗 License¶
This project is licensed under the MIT License. See LICENSE file for details.
🔗 Acknowledgments¶
- EPA for comprehensive environmental monitoring data and open APIs
- AirNow for real-time air quality data infrastructure
- Azure Cloud Scale Analytics team for the foundational platform
- Contributors and the open-source community
🔗 Related Documentation¶
- EPA Architecture — Detailed platform architecture and design decisions
- Examples Index — Overview of all CSA-in-a-Box example verticals
- Platform Architecture — Core CSA platform architecture
- Getting Started Guide — Platform setup and onboarding
- NOAA Climate Analytics — Related environmental/climate vertical
- USDA Agricultural Analytics — Related agriculture/environment vertical
Prerequisites / Cost / Teardown¶
[!IMPORTANT] Cost-safety: this vertical deploys real Azure resources. Always run
teardown.shwhen you are done. A forgotten workshop environment can run $150-250/day.
Prerequisites¶
- Azure CLI 2.50+ logged in (
az login), subscription selected (az account set --subscription <id>) jqinstalled (used by teardown enumeration)- Bicep CLI 0.25+ (
az bicep version) - Contributor + User Access Administrator on target subscription (or a pre-created RG with equivalent RBAC)
bash scripts/deploy/validate-prerequisites.shpasses
Cost estimate (rough, East US 2)¶
- While running: ~$$150-250/day (services: Synapse, Databricks, Event Hub, Stream Analytics, ADX, Storage, Key Vault)
- Idle overnight: roughly half if you stop compute (Databricks autostop + Synapse pause)
- Storage + Key Vault residual: <$5/month if you skip teardown
Numbers are indicative for a small demo dataset; production workloads vary significantly. Use az consumption usage list or Cost Management for live numbers.
Runtime¶
- Deploy: ~35-50 minutes (first run; cold Bicep)
- Teardown: ~10-15 minutes (async RG delete completes in the background)
Teardown¶
When finished, run the per-example teardown script. It enforces a typed DESTROY-epa confirmation, logs every step to reports/teardown/epa-<timestamp>.log, and deletes the resource group rg-epa-analytics along with any matching subscription-scope deployments.
# Interactive (recommended)
bash examples/epa/deploy/teardown.sh
# Dry run (enumerate only)
bash examples/epa/deploy/teardown.sh --dry-run
# From the repo root via Makefile
make teardown-example VERTICAL=epa
make teardown-example VERTICAL=epa DRYRUN=1
# CI automation (no prompt — only for ephemeral environments)
bash examples/epa/deploy/teardown.sh --yes
See docs/QUICKSTART.md#teardown for the platform-wide teardown flow.
Directory Structure¶
epa/
├── contracts/ # Data product contracts (schemas, SLOs, owners)
│ ├── air-quality-analytics.yaml
│ ├── toxic-releases.yaml
│ └── water-systems.yaml
├── data/ # Sample data + synthetic generators
│ ├── generators/
│ └── open-data/
├── deploy/ # Deployment parameters / Bicep templates
│ ├── params.dev.json
│ ├── params.gov.json
│ └── teardown.sh
├── domains/ # dbt models (bronze / silver / gold) and seeds
│ └── dbt/
├── notebooks/ # Synapse / Fabric / Databricks notebooks
│ ├── air_quality_forecasting.py
│ └── environmental_justice_analysis.py
├── reports/ # Power BI report templates and pbix sources
├── ARCHITECTURE.md # Mermaid + prose architecture diagrams
└── README.md # This file
Expected Results¶
After running the medallion pipeline against the bundled seed data, the Gold layer should populate the following tables. Row counts vary with the seed-data generator parameters; the figures below are the approximate scale you should see on a default run.
| Gold Table | Approximate Rows | Notes |
|---|---|---|
gld_aqi_forecast | TODO: capture after first run | Populated from Silver via dbt --select tag:gold |
gld_compliance_dashboard | TODO: capture after first run | Populated from Silver via dbt --select tag:gold |
gld_environmental_justice | TODO: capture after first run | Populated from Silver via dbt --select tag:gold |
TODO: capture exact counts after the next end-to-end seed run. These are bounded by the seed-data generator parameters in
data/generators/.