Source: examples/usda/README.md — this page is rendered live from that file.

CIPSEA awareness

The data in this example may be subject to CIPSEA (the Confidential Information Protection and Statistical Efficiency Act, 44 U.S.C. §§ 3561–3583) when collected from respondents under a pledge of confidentiality for exclusively statistical purposes.

Knowing and willful disclosure of identifiable CIPSEA data is a Class E felony (§ 3572) attaching to individual officers, employees, or designated agents — including cloud-operator personnel where applicable.

The architecture below is starting-point reference guidance only. Validate the specific compliance posture for your workload with your designating statistical agency and Confidentiality Officer before production use:

CIPSEA control mapping & narrative (DRAFT — under validation)
CIPSEA operational playbook for Azure (DRAFT — under validation)

USDA Agricultural Analytics Platform¶

Examples > USDA

[!TIP] TL;DR — Agricultural analytics platform providing crop yield forecasting, SNAP enrollment analysis, and food safety risk scoring using official USDA data sources with NASS QuickStats API integration.

📋 Table of Contents¶

Overview
Key Features
Data Sources
Architecture Overview
Prerequisites
Azure Resources
Tools Required
API Access
Quick Start
1. Environment Setup
2. Configure API Keys
3. Generate Sample Data
4. Deploy Infrastructure
5. Run dbt Models
Sample Analytics Scenarios
1. Crop Yield Forecasting
2. SNAP Enrollment Analysis
3. Food Safety Risk Assessment
4. Agricultural Dashboard Summary
Data Products
Crop Yields
SNAP Enrollment
Food Safety Risk
Configuration
dbt Profiles
Environment Variables
Monitoring & Alerts
Development
Adding New Data Sources
Testing
Troubleshooting
Common Issues
Logs
Contributing
License
Support
Acknowledgments

A comprehensive agricultural analytics platform built on Azure Cloud Scale Analytics (CSA), providing insights into crop yields, food safety, nutrition assistance programs, and agricultural economic indicators using official USDA data sources.

📋 Overview¶

This platform ingests, processes, and analyzes data from multiple USDA agencies to provide actionable insights for agricultural decision-making, policy analysis, and research. The platform follows the medallion architecture (Bronze → Silver → Gold) and implements modern data engineering best practices.

✨ Key Features¶

Real-time Agricultural Data: Automated ingestion from USDA NASS, FNS, FSIS, and FoodData Central
Crop Yield Analytics: Historical trends, forecasting, and regional comparisons
Food Safety Monitoring: FSIS inspection tracking with risk scoring
SNAP Program Analytics: Enrollment trends and demographic analysis
Interactive Dashboards: Executive dashboards with KPIs and drill-down capabilities
API-First Architecture: RESTful APIs for all data products

🗄️ Data Sources¶

NASS (National Agricultural Statistics Service): Crop yields, planted acres, production data
FNS (Food and Nutrition Service): SNAP enrollment and benefits data
FSIS (Food Safety and Inspection Service): Meat, poultry, and egg inspection records
FoodData Central: Nutritional information and food composition data

🏗️ Architecture Overview¶

graph TD
    A[USDA APIs] --> B[Bronze Layer]
    B --> C[Silver Layer]
    C --> D[Gold Layer]
    D --> E[Analytics & Dashboards]

    subgraph "Data Sources"
        A1[NASS QuickStats API]
        A2[FNS SNAP Data]
        A3[FSIS Inspection Data]
        A4[FoodData Central API]
    end

    subgraph "Bronze Layer"
        B1[brz_crop_yields]
        B2[brz_snap_enrollment]
        B3[brz_food_inspections]
    end

    subgraph "Silver Layer"
        C1[slv_crop_yields]
        C2[slv_snap_enrollment] 
        C3[slv_food_inspections]
    end

    subgraph "Gold Layer"
        D1[gld_crop_yield_forecast]
        D2[gld_snap_trends]
        D3[gld_food_safety_risk_score]
        D4[gld_agricultural_dashboard]
    end

    subgraph "Consumption"
        E1[Executive Dashboard]
        E2[Policy Analysis Reports]
        E3[Research Analytics]
        E4[Public APIs]
    end

    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B1

    B1 --> C1
    B2 --> C2
    B3 --> C3

    C1 --> D1
    C2 --> D2
    C3 --> D3
    C1 --> D4
    C2 --> D4
    C3 --> D4

    D1 --> E1
    D2 --> E1
    D3 --> E1
    D4 --> E1
    D1 --> E2
    D2 --> E2
    D3 --> E2
    D1 --> E3
    D2 --> E3
    D3 --> E3

📎 Prerequisites¶

Azure Resources¶

Azure subscription with contributor access
Azure Data Factory or Synapse Analytics
Azure Data Lake Storage Gen2
Azure SQL Database or Synapse SQL Pool
Azure Key Vault for API credentials

Tools Required¶

Azure CLI (2.55.0 or later)
dbt CLI (1.7.0 or later)
Python 3.9+
Git

API Access¶

USDA NASS QuickStats API key (free registration at https://quickstats.nass.usda.gov/api)
Data.gov API key (optional, for enhanced rate limits)

🚀 Quick Start¶

1. Environment Setup¶

# Clone the repository
git clone <repository-url>
cd csa-inabox/examples/usda

# Install Python dependencies
pip install -r requirements.txt

# Install dbt packages
cd domains/dbt
dbt deps

2. Configure API Keys¶

# Add to Azure Key Vault or local environment
export NASS_API_KEY="your-nass-api-key"
export DATAGOV_API_KEY="your-datagov-api-key"  # Optional

3. Generate Sample Data¶

# Generate sample data (fallback if APIs unavailable)
python data/generators/generate_usda_data.py --output-dir domains/dbt/seeds

# Or fetch real data from APIs
python data/open-data/fetch_nass.py --api-key $NASS_API_KEY --states "IA,IL,IN" --years "2020,2021,2022"

4. Deploy Infrastructure¶

# Configure parameters
cp deploy/params.dev.json deploy/params.local.json
# Edit params.local.json with your values

# Deploy using Azure CLI
az deployment group create \
  --resource-group rg-usda-analytics \
  --template-file ../../deploy/bicep/DLZ/main.bicep \
  --parameters @deploy/params.local.json

5. Run dbt Models¶

cd domains/dbt

# Test connections
dbt debug

# Load seed data
dbt seed

# Run models
dbt run

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve

💡 Sample Analytics Scenarios¶

1. Crop Yield Forecasting¶

-- Get corn yield trend for Iowa
SELECT 
    year,
    commodity,
    state_code,
    yield_per_acre,
    yield_3yr_avg,
    yield_pct_change
FROM gold.gld_crop_yield_forecast
WHERE commodity = 'CORN' 
    AND state_code = 'IA'
    AND year >= 2018
ORDER BY year DESC;

2. SNAP Enrollment Analysis¶

-- States with highest SNAP enrollment growth
SELECT 
    state_code,
    latest_enrollment,
    enrollment_change_1yr,
    enrollment_pct_change_1yr
FROM gold.gld_snap_trends
WHERE year = 2023
ORDER BY enrollment_pct_change_1yr DESC
LIMIT 10;

3. Food Safety Risk Assessment¶

-- High-risk food establishments
SELECT 
    establishment_id,
    establishment_name,
    risk_score,
    inspection_frequency,
    violation_rate,
    last_inspection_date
FROM gold.gld_food_safety_risk_score
WHERE risk_score >= 75
ORDER BY risk_score DESC;

4. Agricultural Dashboard Summary¶

-- Executive summary metrics
SELECT 
    report_date,
    total_crop_production_value,
    snap_enrollment_total,
    food_safety_incidents,
    agricultural_employment
FROM gold.gld_agricultural_dashboard
WHERE report_date = CURRENT_DATE
ORDER BY report_date DESC;

✨ Data Products¶

Crop Yields (`crop-yields`)¶

Description: Historical and forecasted crop yield data by commodity, state, and county
Freshness: Daily updates
Coverage: 2000-present, all major commodities
API: /api/v1/crop-yields

SNAP Enrollment (`snap-enrollment`)¶

Description: Supplemental Nutrition Assistance Program enrollment and benefits data
Freshness: Monthly updates
Coverage: 2010-present, state and county level
API: /api/v1/snap-enrollment

Food Safety Risk (`food-safety-risk`)¶

Description: FSIS inspection data with computed risk scores by establishment
Freshness: Weekly updates
Coverage: 2015-present, all inspected facilities
API: /api/v1/food-safety-risk

⚙️ Configuration¶

⚙️ dbt Profiles¶

Add to your ~/.dbt/profiles.yml:

usda_analytics:
  target: dev
  outputs:
    dev:
      type: databricks
      host: "{{ env_var('DBT_HOST') }}"
      http_path: "{{ env_var('DBT_HTTP_PATH') }}"
      token: "{{ env_var('DBT_TOKEN') }}"
      schema: usda_dev
      catalog: dev
    prod:
      type: databricks
      host: "{{ env_var('DBT_HOST_PROD') }}"
      http_path: "{{ env_var('DBT_HTTP_PATH_PROD') }}"
      token: "{{ env_var('DBT_TOKEN_PROD') }}"
      schema: usda
      catalog: prod

⚙️ Environment Variables¶

# Required for data fetching
NASS_API_KEY=your-nass-api-key
DATAGOV_API_KEY=your-datagov-api-key

# Required for dbt
DBT_HOST=your-databricks-host
DBT_HTTP_PATH=your-sql-warehouse-path
DBT_TOKEN=your-access-token

# Optional
USDA_LOG_LEVEL=INFO
USDA_BATCH_SIZE=1000

📊 Monitoring & Alerts¶

The platform includes built-in monitoring for:

Data Freshness: Alerts when data sources haven't updated within SLA
Data Quality: Automated tests on all models with Slack notifications
API Performance: Response time and error rate monitoring
Cost Management: Daily spend alerts and optimization recommendations

🚀 Development¶

🗄️ Adding New Data Sources¶

Create Bronze model in domains/dbt/models/bronze/
Add data quality tests in schema.yml
Create corresponding Silver model with transformations
Add to Gold aggregations as needed
Update data contracts in contracts/

🧪 Testing¶

# Unit tests
pytest data/tests/

# dbt tests
dbt test

# Integration tests
pytest data/tests/integration/

# Load tests
python data/tests/load_test.py

🔧 Troubleshooting¶

🔧 Common Issues¶

API Rate Limits: NASS API has rate limits. Use the --delay parameter in data generation scripts.
Authentication Errors: Ensure API keys are properly configured in environment variables or Key Vault.
dbt Connection Issues: Verify Databricks credentials and network connectivity.
Large Data Volumes: Use incremental models and partitioning for historical data loads.

📊 Logs¶

Application logs: logs/usda-analytics.log
dbt logs: domains/dbt/logs/dbt.log
Data pipeline logs: Azure Data Factory monitoring

🔗 Contributing¶

Fork the repository
Create a feature branch (git checkout -b feature/new-data-source)
Make changes and add tests
Run quality checks (make lint test)
Submit a pull request

🔗 License¶

This project is licensed under the MIT License. See LICENSE file for details.

🔗 Support¶

Documentation: https://csa-inabox.docs.microsoft.com/usda
Issues: Use GitHub Issues for bug reports and feature requests
Security: Report security issues to security@contoso.com
Community: Join our Slack channel #csa-usda-analytics

🔗 Acknowledgments¶

USDA for providing comprehensive public agricultural data APIs
Azure Cloud Scale Analytics team for the foundational platform
Contributors and the open-source community

USDA Architecture — Detailed platform architecture and design decisions
USDA Implementation Summary — Implementation details and file structure
Examples Index — Overview of all CSA-in-a-Box example verticals
Platform Architecture — Core CSA platform architecture
Getting Started Guide — Platform setup and onboarding
EPA Environmental Analytics — Related agriculture/environment vertical
NOAA Climate Analytics — Related environmental data vertical

Prerequisites / Cost / Teardown¶

[!IMPORTANT] Cost-safety: this vertical deploys real Azure resources. Always run teardown.sh when you are done. A forgotten workshop environment can run $180-300/day.

Prerequisites¶

Azure CLI 2.50+ logged in (az login), subscription selected (az account set --subscription <id>)
jq installed (used by teardown enumeration)
Bicep CLI 0.25+ (az bicep version)
Contributor + User Access Administrator on target subscription (or a pre-created RG with equivalent RBAC)
bash scripts/deploy/validate-prerequisites.sh passes

Cost estimate (rough, East US 2)¶

While running: ~$$180-300/day (services: Synapse, Databricks, ML, ADX, ADF, Cosmos DB, Storage, Key Vault)
Idle overnight: roughly half if you stop compute (Databricks autostop + Synapse pause)
Storage + Key Vault residual: <$5/month if you skip teardown

Numbers are indicative for a small demo dataset; production workloads vary significantly. Use az consumption usage list or Cost Management for live numbers.

Runtime¶

Deploy: ~45-60 minutes (first run; cold Bicep)
Teardown: ~15-20 minutes (async RG delete completes in the background)

Teardown¶

When finished, run the per-example teardown script. It enforces a typed DESTROY-usda confirmation, logs every step to reports/teardown/usda-<timestamp>.log, and deletes the resource group rg-usda-analytics along with any matching subscription-scope deployments.

# Interactive (recommended)
bash examples/usda/deploy/teardown.sh

# Dry run (enumerate only)
bash examples/usda/deploy/teardown.sh --dry-run

# From the repo root via Makefile
make teardown-example VERTICAL=usda
make teardown-example VERTICAL=usda DRYRUN=1

# CI automation (no prompt — only for ephemeral environments)
bash examples/usda/deploy/teardown.sh --yes

See docs/QUICKSTART.md#teardown for the platform-wide teardown flow.

Directory Structure¶

usda/
├── contracts/                # Data product contracts (schemas, SLOs, owners)
│   ├── crop-yields.yaml
│   ├── food-safety-risk.yaml
│   └── snap-enrollment.yaml
├── data/                     # Sample data + synthetic generators
│   ├── generators/
│   └── open-data/
├── deploy/                   # Deployment parameters / Bicep templates
│   ├── params.dev.json
│   ├── params.gov.json
│   └── teardown.sh
├── domains/                  # dbt models (bronze / silver / gold) and seeds
│   └── dbt/
├── notebooks/                # Synapse / Fabric / Databricks notebooks
│   ├── crop_yield_analysis.py
│   └── snap_demographics.py
├── reports/                  # Power BI report templates and pbix sources
│   └── usda_dashboard.kql
├── ARCHITECTURE.md           # Mermaid + prose architecture diagrams
├── IMPLEMENTATION_SUMMARY.md  # Build-out summary and status
└── README.md                 # This file

Expected Results¶

After running the medallion pipeline against the bundled seed data, the Gold layer should populate the following tables. Row counts vary with the seed-data generator parameters; the figures below are the approximate scale you should see on a default run.

Gold Table	Approximate Rows	Notes
`gld_agricultural_dashboard`	TODO: capture after first run	Populated from Silver via dbt `--select tag:gold`
`gld_crop_yield_forecast`	TODO: capture after first run	Populated from Silver via dbt `--select tag:gold`
`gld_food_safety_risk_score`	TODO: capture after first run	Populated from Silver via dbt `--select tag:gold`
`gld_snap_trends`	TODO: capture after first run	Populated from Silver via dbt `--select tag:gold`

TODO: capture exact counts after the next end-to-end seed run. These are bounded by the seed-data generator parameters in data/generators/.