🌾 Tutorial 32: USDA Agriculture Analytics¶

🏠 Home > 📖 Tutorials > 🌾 USDA Agriculture Analytics

Third-party references — publicly sourced, good-faith comparison

This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.

🌾 Tutorial 32: USDA Agriculture Analytics¶


Difficulty	⭐⭐⭐ Intermediate
Time	⏱️ 90-120 minutes
Focus	USDA Crop Production Analytics, Food Safety Monitoring, Federal Agricultural Data Governance

📊 Progress Tracker¶

┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│  00  │  01  │  02  │  03  │  04  │  05  │  06  │  07  │  08  │  09  │  10  │  11  │  12  │  13  │
│SETUP │BRNZE │SILVR │ GOLD │  RT  │ PBI  │PIPES │ GOV  │MIRRR │AI/ML │TDATA │ SAS  │CICD  │MIGR  │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘

┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│  14  │  15  │  16  │  17  │  18  │  19  │  20  │  21  │  22  │  23  │  24  │  25  │  26  │  27  │  28  │  29  │  30  │  31  │
│ SEC  │ COST │PERF  │ MON  │SHARE │COPLT │WKBST │ GEO  │ NET  │SHIR  │ SNW  │ DB2  │MULTI │VIDEO │ MOVE │GEOLC │TRIBL │ DOT  │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │  ✅  │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘

┌──────┬──────┬──────┬──────┬──────┐
│  32  │  33  │  34  │  35  │  36  │
│ USDA │ SBA  │ NOAA │ EPA  │ DOI  │
├──────┼──────┼──────┼──────┼──────┤
│  🔵  │  ○   │  ○   │  ○   │  ○   │
└──────┴──────┴──────┴──────┴──────┘
   ▲
YOU ARE HERE

Navigation
⬅️ Previous	31-Federal DOT/FAA Analytics
➡️ Next	33-SBA Small Business Analytics

📖 Overview¶

The United States Department of Agriculture (USDA) is one of the most data-intensive federal agencies in the world. Through its sub-agencies -- the National Agricultural Statistics Service (NASS), the Food Safety and Inspection Service (FSIS), the Food and Nutrition Service (FNS), and dozens more -- the USDA collects, validates, and publishes data that drives agricultural policy, food safety enforcement, and nutrition program administration for a nation of 330 million people.

This tutorial teaches you to build a federal agricultural analytics pipeline in Microsoft Fabric that ingests USDA crop production statistics and food safety recall records, transforms them through a domain-specific medallion architecture, and produces Gold layer KPIs and Power BI dashboards for crop trend analysis and recall monitoring.

Two data domains form the backbone of this tutorial:

Crop Production (NASS): National, state, and county-level statistics on planted acreage, harvested acreage, yield, production volume, and price received for 10 major commodities (corn, soybeans, wheat, cotton, rice, barley, oats, sorghum, hay, and potatoes). This data mirrors the structure returned by the NASS QuickStats API, the primary programmatic interface for accessing USDA agricultural statistics.
Food Safety (FSIS): Recall records issued by the Food Safety and Inspection Service covering meat, poultry, and processed food recalls. Each recall includes classification (Class I through III based on health risk severity), reason (pathogen contamination, undeclared allergens, foreign material), distribution scope, and pounds recalled. This data mirrors the FSIS Recall Case Archive.

💡 Why Agricultural Analytics on Fabric?

Unified analytics over massive national datasets: NASS QuickStats alone contains over 50 GB of agricultural statistics spanning decades. Fabric's OneLake storage and lakehouse architecture handle this scale natively without infrastructure provisioning

Cross-domain insight: Joining crop production data with food safety recall data reveals patterns invisible in either dataset alone -- for example, whether production surges in specific commodities correlate with increased recall events

Federal open data compliance: All USDA datasets used in this tutorial are publicly available under federal open data policies, with clearly defined API access patterns and download endpoints

Direct Lake for real-time decision support: Agricultural analysts and policy makers need dashboards that refresh in near real-time during critical periods (planting season, harvest season, food safety incidents) without the latency of traditional import models

⚠️ Federal Data Notice

USDA data used in this tutorial is sourced from publicly available federal datasets subject to the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) and the Food Safety Modernization Act (FSMA). When working with real USDA data:

NASS QuickStats data is aggregated -- individual farm-level data is protected under CIPSEA and is never published

FSIS recall data is public record, but company names and establishment numbers should be handled with care in analytics to avoid defamation risk

Statistical data should not be misrepresented; always cite the source and methodology

API rate limits apply to the NASS QuickStats API (see API documentation)

🎯 Learning Objectives¶

By the end of this tutorial, you will be able to:

🏗️ Architecture Diagram¶

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#2E7D32','primaryTextColor':'#fff','primaryBorderColor':'#1B5E20','lineColor':'#43A047','secondaryColor':'#E8F5E9','tertiaryColor':'#fff'}}}%%
flowchart TB
    subgraph Sources["🌾 USDA Data Sources"]
        NASS["📊 NASS QuickStats API\n(Crop Production)"]
        FSIS["🔬 FSIS Recall Archive\n(Food Safety)"]
        SYNTH["⚙️ Synthetic Generator\n(USDAGenerator)"]
    end

    subgraph Ingestion["📥 Data Ingestion"]
        API["🔑 API Download\n(api-key required)"]
        BULK["📦 Bulk CSV/Parquet\nDownload"]
        GEN["🐍 Python Generator\n(seed=42)"]
    end

    subgraph Fabric["🔷 Microsoft Fabric (F64 Workspace)"]
        direction TB

        subgraph Bronze["🥉 Bronze Layer"]
            BRZ_CROP["bronze_usda_crop_production\n(append-only, partitioned by load_date)"]
            BRZ_FOOD["bronze_usda_food_safety\n(append-only, partitioned by load_date)"]
        end

        subgraph Silver["🥈 Silver Layer"]
            SLV_CROP["silver_usda_crop_production\n(validated, deduped, partitioned by year)"]
            SLV_FOOD["silver_usda_food_safety\n(validated, deduped, partitioned by recall_year)"]
        end

        subgraph Gold["🥇 Gold Layer"]
            GLD_CROP["gold_usda_crop_summary\n(YoY trends, rankings, flags)"]
            GLD_STATE["gold_usda_state_agriculture\n(diversity, dominance, value)"]
            GLD_FOOD["gold_usda_food_safety_dashboard\n(severity scores, rolling trends)"]
            GLD_EXEC["gold_usda_executive_summary\n(cross-domain annual view)"]
        end

        PURV["🏛️ Microsoft Purview\n(Lineage + Classification)"]
    end

    subgraph Analytics["📊 Agricultural Analytics"]
        PBI["📊 Power BI\n(Direct Lake)"]
        CROP_DASH["🗺️ Crop Production\nMap & Trends"]
        FOOD_DASH["🔴 Food Safety\nRecall Monitor"]
        EXEC_DASH["📋 Executive\nScorecard"]
    end

    NASS --> API
    FSIS --> BULK
    SYNTH --> GEN

    API --> BRZ_CROP
    BULK --> BRZ_FOOD
    GEN --> BRZ_CROP
    GEN --> BRZ_FOOD

    BRZ_CROP --> SLV_CROP
    BRZ_FOOD --> SLV_FOOD

    SLV_CROP --> GLD_CROP
    SLV_CROP --> GLD_STATE
    SLV_FOOD --> GLD_FOOD
    SLV_CROP --> GLD_EXEC
    SLV_FOOD --> GLD_EXEC

    PURV -.-> Bronze
    PURV -.-> Silver
    PURV -.-> Gold

    GLD_CROP --> PBI
    GLD_STATE --> PBI
    GLD_FOOD --> PBI
    GLD_EXEC --> PBI

    PBI --> CROP_DASH
    PBI --> FOOD_DASH
    PBI --> EXEC_DASH

    style Sources fill:#E8F5E9
    style Ingestion fill:#FFF3E0
    style Fabric fill:#E0F2F1
    style Bronze fill:#FFF8E1
    style Silver fill:#E3F2FD
    style Gold fill:#FFF3E0
    style Analytics fill:#F3E5F5

Component	Technology	Purpose
NASS QuickStats API	REST API (JSON/CSV)	Crop production, yield, acreage, and price statistics by state and county
FSIS Recall Archive	Bulk CSV/JSON download	Food safety recall records with classification, reason, and distribution
Synthetic Generator	Python (`USDAGenerator`)	Reproducible test data aligned to real schema for development and POC
Bronze	Delta Lake (append-only)	Raw records with ingestion metadata, partitioned by `_bronze_load_date`
Silver	Delta Lake (validated)	FIPS-validated, standardized, deduplicated records with DQ scores, partitioned by year
Gold	Delta Lake (aggregated)	Commodity KPIs, state profiles, recall dashboards, and executive summary
Purview	Microsoft Purview	Data lineage, federal data classification, access governance
Power BI	Direct Lake	Crop production maps, commodity trends, recall monitoring, executive scorecards

📋 Prerequisites¶

Before starting this tutorial, ensure you have:

Completed Tutorial 00: Environment Setup
Completed Tutorial 01: Bronze Layer
Completed Tutorial 02: Silver Layer
Completed Tutorial 03: Gold Layer
Fabric workspace with F64 capacity (or F2+ for development/testing)
Python 3.10+ with pandas, numpy, faker installed (for the data generator)
Familiarity with PySpark, Delta Lake, and DAX patterns from prior tutorials

Recommended (not required):

Completed Tutorial 05: Direct Lake & Power BI -- for the Power BI dashboard steps
Completed Tutorial 07: Governance & Purview -- for the governance configuration steps

💡 Two Data Paths Available

This tutorial supports two data source options:

Option A: Synthetic Data Generator -- No API key required, runs locally, produces schema-aligned test data. Best for development and POC demonstrations.

Option B: Real Open Data Download -- Uses the NASS QuickStats API (free API key required) and FSIS recall archive (no key required). Best for production-quality demonstrations with real federal data.

All downstream notebooks work identically with either data source because both produce the same schema.

🛠️ Step 1: Data Source Setup¶

Choose Option A (synthetic data) or Option B (real open data) based on your requirements. Both paths produce Parquet files with identical schemas that the Bronze ingestion notebook consumes without modification.

Option A: Synthetic Data Generation¶

The USDAGenerator produces realistic synthetic records that mirror the schema and value distributions of real USDA data. This is the fastest path to a working pipeline.

from data_generation.generators.federal.usda_generator import USDAGenerator

# Initialize with seed for reproducibility
gen = USDAGenerator(seed=42)

# Generate crop production records (NASS QuickStats format)
crop_df = gen.generate_batch(count=10000, domain="crop_production")
print(f"Crop production records: {len(crop_df):,}")
print(f"Columns: {list(crop_df.columns)}")
print(f"Commodities: {crop_df['commodity'].nunique()} unique")
print(f"States: {crop_df['state_name'].nunique()} unique")
print(f"Year range: {crop_df['year'].min()} - {crop_df['year'].max()}")

# Generate food safety records (FSIS recall format)
food_df = gen.generate_batch(count=5000, domain="food_safety")
print(f"\nFood safety records: {len(food_df):,}")
print(f"Recall classes: {food_df['recall_class'].value_counts().to_dict()}")
print(f"Product types: {food_df['product_type'].nunique()} unique")

# Export to Parquet for notebook consumption
crop_df.to_parquet("data_generation/output/usda_crop_production.parquet", index=False)
food_df.to_parquet("data_generation/output/usda_food_safety.parquet", index=False)

print("\nParquet files written to data_generation/output/")

Expected output:

Crop production records: 10,000
Columns: ['record_id', 'commodity', 'year', 'state_fips', 'state_name', ...]
Commodities: 10 unique
States: 25 unique
Year range: 2015 - 2025

Food safety records: 5,000
Recall classes: {'Class I': ~2750, 'Class II': ~1750, 'Class III': ~500}
Product types: 7 unique

Parquet files written to data_generation/output/

💡 Generator Details

The USDAGenerator supports two domains: - crop_production -- NASS QuickStats-style records with commodity, year, state FIPS, county FIPS (for county-level aggregation), statistic category (AREA PLANTED, AREA HARVESTED, YIELD, PRODUCTION, PRICE RECEIVED), value, unit, source (SURVEY/CENSUS), and aggregation level (NATIONAL/STATE/COUNTY) - food_safety -- FSIS recall-style records with recall number (FSIS-YYYY-NNN format), recall date, product type, recall class (I/II/III), reason (15 possible reasons including pathogen contamination and undeclared allergens), risk level, company, establishment number, pounds recalled, and distribution scope

See data_generation/generators/federal/usda_generator.py for the full implementation.

Option B: Real Open Data Download¶

Use the included download script to fetch real USDA data from public APIs and archives.

Step B.1: Obtain a NASS API Key¶

The NASS QuickStats API requires a free API key:

Visit https://quickstats.nass.usda.gov/api#param_define
Click "Request API Key"
Enter your email address and submit
Check your email for the API key (typically arrives within minutes)
Save the key -- you will need it for the download script

⚠️ API Rate Limits

The NASS QuickStats API limits responses to 50,000 records per query. The download script handles pagination automatically, but large date ranges for all commodities may require multiple queries. The script includes built-in rate limiting (1-second delay between requests) to avoid API throttling.

Step B.2: Run the Download Script¶

# Download NASS crop production data (requires API key)
python -m data_generation.open_data.usda_download --dataset nass \
    --api-key YOUR_NASS_API_KEY \
    --output-dir data_generation/output/

# Download FSIS recall data (no API key required)
python -m data_generation.open_data.usda_download --dataset fsis \
    --output-dir data_generation/output/

# Or download both at once
python -m data_generation.open_data.usda_download --dataset all \
    --api-key YOUR_NASS_API_KEY \
    --output-dir data_generation/output/

The download script produces Parquet files with the same column names and types as the synthetic generator, so all downstream notebooks work with either data source.

Step B.3: Verify Downloaded Data¶

import pandas as pd

crop_df = pd.read_parquet("data_generation/output/usda_crop_production.parquet")
food_df = pd.read_parquet("data_generation/output/usda_food_safety.parquet")

print(f"Crop production: {len(crop_df):,} records, {crop_df['commodity'].nunique()} commodities")
print(f"Food safety: {len(food_df):,} records, {food_df['recall_class'].nunique()} recall classes")

💡 Data Source Configuration

Full details on available USDA datasets, API endpoints, and download options are documented in data_generation/config/federal_datasets.yaml under the usda agency section.

Verification Checkpoint¶

Before proceeding, confirm you have two Parquet files in data_generation/output/:

File	Expected Records	Key Columns
`usda_crop_production.parquet`	10,000+ (synthetic) or varies (real)	commodity, year, state_fips, statisticcat_desc, value
`usda_food_safety.parquet`	5,000+ (synthetic) or varies (real)	recall_number, recall_date, product_type, recall_class, risk_level

🛠️ Step 2: Bronze Layer Ingestion¶

The Bronze layer ingests raw USDA data with minimal transformation -- only ingestion metadata columns are added. Data quality checks at this layer are limited to verifying that critical fields are populated.

2.1 Upload Data to Fabric Lakehouse¶

Upload the Parquet files from Step 1 to your Fabric lakehouse:

Open your Fabric workspace and navigate to the lh_bronze lakehouse
Click Get data > Upload files
Upload both files to Files/output/:
usda_crop_production.parquet → Files/output/usda_crop_production.parquet
usda_food_safety.parquet → Files/output/usda_food_safety.parquet

💡 Alternative: OneLake File Explorer

If you have OneLake File Explorer installed, you can drag and drop the Parquet files directly into the lakehouse file system from Windows Explorer.

2.2 Run the Bronze Ingestion Notebook¶

Open and run the Bronze ingestion notebook: notebooks/bronze/12_bronze_usda.py

The notebook performs the following operations:

# Source paths - data generator parquet output (primary)
SOURCE_PATHS = {
    "crop_production": "Files/output/usda_crop_production.parquet",
    "food_safety": "Files/output/usda_food_safety.parquet",
}

# Target Delta tables
TARGET_TABLES = {
    "crop_production": "bronze_usda_crop_production",
    "food_safety": "bronze_usda_food_safety",
}

Key operations performed by the Bronze notebook:

Schema enforcement -- Explicit StructType schemas are applied at read time for both domains. Crop production enforces 14 fields (commodity, year, state_fips, state_name, county_fips, county_name, statisticcat_desc, unit_desc, value, cv_percent, source_desc, agg_level_desc, domain_desc, reference_period_desc). Food safety enforces 14 fields (recall_id, recall_number, recall_date, product_type, recall_class, reason, risk_level, company_name, establishment_number, city, state, pounds_recalled, distribution, status).
Critical field null checks -- Crop production requires non-null: commodity, year, state_fips, state_name. Food safety requires non-null: recall_id, recall_number, recall_date.
Bronze metadata columns -- Each record receives: _bronze_ingested_at (timestamp), _bronze_source_file (input file name), _bronze_batch_id (batch identifier), _bronze_load_date (date partition key).
Delta Lake write -- Append mode with partition by _bronze_load_date and mergeSchema=true for schema evolution support.
CSV fallback -- If Parquet files are not found, the notebook falls back to CSV files in Files/open_data/usda/ (useful when loading data from alternative download paths).

2.3 Verify Bronze Ingestion¶

After running the notebook, verify the results:

# Verify crop production table
df_crop = spark.table("bronze_usda_crop_production")
print(f"Crop Production Records: {df_crop.count():,}")
print(f"Partitions: {df_crop.select('_bronze_load_date').distinct().count()}")

# Verify food safety table
df_food = spark.table("bronze_usda_food_safety")
print(f"Food Safety Records: {df_food.count():,}")
print(f"Partitions: {df_food.select('_bronze_load_date').distinct().count()}")

# Check commodity distribution
df_crop.groupBy("commodity").count().orderBy("count", ascending=False).show()

# Check recall class distribution
df_food.groupBy("recall_class").count().orderBy("count", ascending=False).show()

Expected results:

Check	Expected
Crop production record count	Matches input Parquet row count
Food safety record count	Matches input Parquet row count
Crop commodity null count	0
Food safety recall_id null count	0
Bronze metadata columns present	4 additional columns per table
Delta table version	0 (first write)

🛠️ Step 3: Silver Layer Transformation¶

The Silver layer applies domain-specific validation, standardization, deduplication, and data quality scoring to produce clean, analytics-ready tables.

3.1 Run the Silver Transformation Notebook¶

Open and run: notebooks/silver/12_silver_usda.py

The notebook reads from Bronze tables and writes to Silver tables:

# Source tables (Bronze)
SOURCE_CROP = "lh_bronze.bronze_usda_crop_production"
SOURCE_FOOD_SAFETY = "lh_bronze.bronze_usda_food_safety"

# Target tables (Silver)
TARGET_CROP = "lh_silver.silver_usda_crop_production"
TARGET_FOOD_SAFETY = "lh_silver.silver_usda_food_safety"

3.2 Crop Production Transformations¶

The Silver notebook applies these transformations to crop production data:

Schema enforcement and type casting:

df_crop_typed = df_crop_bronze \
    .withColumn("value", col("value").cast(DecimalType(18, 2))) \
    .withColumn("cv_percent", col("cv_percent").cast(DecimalType(8, 2))) \
    .withColumn("year", col("year").cast(IntegerType()))

FIPS code validation -- State FIPS codes are validated against a comprehensive lookup table of 50 US states plus territories. County FIPS codes follow the 5-digit pattern (2-digit state prefix + 3-digit county suffix).

Commodity and statistic category validation -- Commodity names are validated against the 10 supported values: CORN, SOYBEANS, WHEAT, COTTON, RICE, BARLEY, OATS, SORGHUM, HAY, POTATOES. Statistic categories are validated against: AREA PLANTED, AREA HARVESTED, YIELD, PRODUCTION, PRICE RECEIVED.

Standardization -- All string fields are uppercased and trimmed. State names are normalized via FIPS lookup to handle spelling variations in source data.

Deduplication -- Records are deduplicated on the composite key: (commodity, year, state_fips, county_fips, statisticcat_desc, source_desc).

Data quality scoring (0-100):

Component	Points	Rule
Valid commodity	20	Commodity in allowlist
Valid state FIPS	20	State FIPS in lookup table
Value not null	20	Statistic value is populated
Valid statistic category	20	Category in allowlist
Valid aggregation level	20	Level is NATIONAL, STATE, or COUNTY

3.3 Food Safety Transformations¶

Schema enforcement and type casting:

df_food_typed = df_food_bronze \
    .withColumn("recall_date", to_date(col("recall_date"))) \
    .withColumn("pounds_recalled", col("pounds_recalled").cast(DecimalType(18, 2)))

Recall class and product type validation -- Recall class validated against: Class I, Class II, Class III. Product type validated against: BEEF, POULTRY, PORK, PROCESSED, READY-TO-EAT, IMPORTED, OTHER.

Status standardization -- Status values are uppercased and validated against: OPEN, CLOSED, EXPANDED.

Deduplication -- Records are deduplicated on recall_number (unique recall identifier in FSIS-YYYY-NNN format).

Data quality scoring (0-100):

Component	Points	Rule
Valid recall class	25	Recall class in allowlist
Valid product type	25	Product type in allowlist
Pounds recalled not null	25	Recall volume is populated
Distribution not null	25	Distribution scope is populated

3.4 Verify Silver Transformations¶

# Verify crop production Silver table
df_crop_silver = spark.table("lh_silver.silver_usda_crop_production")
print(f"Silver crop records: {df_crop_silver.count():,}")

# Average data quality score
avg_dq = df_crop_silver.agg({"_dq_score": "avg"}).collect()[0][0]
print(f"Avg crop DQ score: {avg_dq:.1f}/100")

# Verify food safety Silver table
df_food_silver = spark.table("lh_silver.silver_usda_food_safety")
print(f"Silver food safety records: {df_food_silver.count():,}")

# Average data quality score
avg_food_dq = df_food_silver.agg({"_dq_score": "avg"}).collect()[0][0]
print(f"Avg food safety DQ score: {avg_food_dq:.1f}/100")

# Check DQ flag distribution
df_crop_silver.groupBy("_dq_score").count().orderBy("_dq_score", ascending=False).show()

Expected results:

Check	Expected
Crop Silver record count	Less than or equal to Bronze (duplicates removed)
Food Safety Silver record count	Less than or equal to Bronze (duplicates removed)
Crop average DQ score	90+ (synthetic data is high quality)
Food Safety average DQ score	85+ (some null distribution/pounds fields expected)
Crop partition column	`year`
Food Safety partition column	`recall_year`
Crop Z-Order	commodity, state_fips
Food Safety Z-Order	recall_class, product_type

🛠️ Step 4: Gold Layer Analytics¶

The Gold layer produces four aggregated tables optimized for Power BI Direct Lake dashboards. Each table serves a specific analytical purpose.

4.1 Run the Gold Analytics Notebook¶

Open and run: notebooks/gold/12_gold_usda_analytics.py

The notebook reads from Silver tables and writes four Gold tables:

# Target tables (Gold)
TARGET_CROP_SUMMARY = "lh_gold.gold_usda_crop_summary"
TARGET_STATE_AGRICULTURE = "lh_gold.gold_usda_state_agriculture"
TARGET_FOOD_SAFETY_DASHBOARD = "lh_gold.gold_usda_food_safety_dashboard"
TARGET_EXECUTIVE_SUMMARY = "lh_gold.gold_usda_executive_summary"

4.2 Table 1: Crop Summary (`gold_usda_crop_summary`)¶

Commodity-level production KPIs aggregated by commodity, year, state, and aggregation level.

Key metrics:

Metric	Description
`total_production`	Sum of PRODUCTION statistic values
`avg_yield`	Average of YIELD statistic values (BU/ACRE)
`total_area_planted`	Sum of AREA PLANTED values (ACRES)
`total_area_harvested`	Sum of AREA HARVESTED values (ACRES)
`avg_price_received`	Average of PRICE RECEIVED values ($/BU)
`year_over_year_change_pct`	Percentage change in total_production vs prior year
`commodity_rank`	Rank by total_production within each year
`performance_flag`	TOP_PRODUCER, DECLINING_YIELD, or EXPANDING_ACREAGE

Performance flags: - TOP_PRODUCER -- Ranked in top 5 by total production within the year - DECLINING_YIELD -- Average yield dropped more than 5% versus prior year - EXPANDING_ACREAGE -- Area planted increased more than 5% versus prior year

4.3 Table 2: State Agriculture Profile (`gold_usda_state_agriculture`)¶

State-level agricultural profile capturing diversity, dominance, and economic value.

Key metrics:

Metric	Description
`total_commodities_grown`	Count of distinct commodities per state per year
`total_production_value`	Estimated value = production volume x average price
`crop_diversity_index`	Number of distinct commodities (higher = more diverse)
`dominant_commodity`	Commodity with highest production volume in that state-year

4.4 Table 3: Food Safety Dashboard (`gold_usda_food_safety_dashboard`)¶

Recall analytics aggregated by year, quarter, recall class, and product type.

Key metrics:

Metric	Description
`total_recalls`	Count of recall events in the period
`total_pounds_recalled`	Total weight of recalled product
`severity_weighted_score`	Weighted score: Class I=3, Class II=2, Class III=1 (range: 1.0-3.0)
`pct_nationwide_distribution`	Percentage of recalls with nationwide distribution
`most_common_reason`	Most frequent recall reason in the period
`rolling_12m_recalls`	Rolling 4-quarter recall count
`rolling_12m_avg_severity`	Rolling 4-quarter average severity score
`recall_trend`	INCREASING, DECREASING, STABLE, or NEW

4.5 Table 4: Executive Summary (`gold_usda_executive_summary`)¶

Cross-domain annual view combining crop production and food safety metrics.

Key metrics:

Metric	Description
`total_crop_production_records`	Annual crop statistic record count
`distinct_commodities`	Number of commodities reported that year
`total_recall_events`	Annual recall count
`class_i_recalls`	Count of highest-severity recalls
`food_safety_index`	(1 - Class I rate) x 100; range 0-100, higher is safer
`top_3_commodities`	Three most-reported commodities
`top_3_recall_reasons`	Three most-common recall reasons

4.6 Verify Gold Layer¶

# Verify all four Gold tables
for table_name in [
    "lh_gold.gold_usda_crop_summary",
    "lh_gold.gold_usda_state_agriculture",
    "lh_gold.gold_usda_food_safety_dashboard",
    "lh_gold.gold_usda_executive_summary"
]:
    df = spark.table(table_name)
    print(f"{table_name}: {df.count():,} records, {len(df.columns)} columns")

# Verify executive summary shows cross-domain join
df_exec = spark.table("lh_gold.gold_usda_executive_summary")
df_exec.select(
    "year", "total_crop_production_records", "total_recall_events",
    "food_safety_index", "top_3_commodities"
).orderBy("year", ascending=False).show(5, truncate=False)

Expected results:

Gold Table	Expected Records	Key Verification
`gold_usda_crop_summary`	Varies (commodity x year x state x agg_level combinations)	`performance_flag` populated for qualifying records
`gold_usda_state_agriculture`	~25 states x ~11 years	`dominant_commodity` populated for each state-year
`gold_usda_food_safety_dashboard`	Varies (year x quarter x class x product combinations)	`severity_weighted_score` between 1.0 and 3.0
`gold_usda_executive_summary`	~11 rows (one per year)	`food_safety_index` between 0 and 100

🛠️ Step 5: Power BI Dashboard¶

Build a Direct Lake-connected Power BI report for agricultural analytics and food safety monitoring.

5.1 Create the Semantic Model¶

In your Fabric workspace, navigate to the lh_gold lakehouse
Click New semantic model
Select the four Gold tables:
gold_usda_crop_summary
gold_usda_state_agriculture
gold_usda_food_safety_dashboard
gold_usda_executive_summary
Name the model: sm_usda_agriculture
Click Confirm to create the Direct Lake semantic model

5.2 Define Relationships¶

In the semantic model editor, create these relationships:

From Table	From Column	To Table	To Column	Cardinality
`gold_usda_crop_summary`	year	`gold_usda_executive_summary`	year	Many-to-One
`gold_usda_state_agriculture`	year	`gold_usda_executive_summary`	year	Many-to-One
`gold_usda_food_safety_dashboard`	recall_year	`gold_usda_executive_summary`	year	Many-to-One
`gold_usda_crop_summary`	state_name	`gold_usda_state_agriculture`	state_name	Many-to-One

5.3 DAX Measures¶

Create the following DAX measures in the semantic model:

// Total US Production (all commodities, all states)
Total US Production =
CALCULATE(
    SUM(gold_usda_crop_summary[total_production]),
    gold_usda_crop_summary[agg_level_desc] = "STATE"
)

// Average National Yield (BU/ACRE)
Avg National Yield =
CALCULATE(
    AVERAGE(gold_usda_crop_summary[avg_yield]),
    gold_usda_crop_summary[agg_level_desc] = "STATE"
)

// Year-over-Year Production Change (formatted)
YoY Production Change =
VAR CurrentYear = MAX(gold_usda_crop_summary[year])
VAR CurrentProduction =
    CALCULATE(
        SUM(gold_usda_crop_summary[total_production]),
        gold_usda_crop_summary[year] = CurrentYear
    )
VAR PriorProduction =
    CALCULATE(
        SUM(gold_usda_crop_summary[total_production]),
        gold_usda_crop_summary[year] = CurrentYear - 1
    )
RETURN
    DIVIDE(CurrentProduction - PriorProduction, PriorProduction, 0)

// Total Recalls (all classes)
Total Recalls =
SUM(gold_usda_food_safety_dashboard[total_recalls])

// Class I Recall Rate (high-severity percentage)
Class I Recall Rate =
DIVIDE(
    CALCULATE(
        SUM(gold_usda_food_safety_dashboard[total_recalls]),
        gold_usda_food_safety_dashboard[recall_class] = "Class I"
    ),
    SUM(gold_usda_food_safety_dashboard[total_recalls]),
    0
)

// Food Safety Index (from executive summary)
Food Safety Index =
AVERAGE(gold_usda_executive_summary[food_safety_index])

// Severity Score (weighted average across all recalls)
Avg Severity Score =
AVERAGE(gold_usda_food_safety_dashboard[severity_weighted_score])

// Total Pounds Recalled (formatted)
Total Pounds Recalled =
FORMAT(
    SUM(gold_usda_food_safety_dashboard[total_pounds_recalled]),
    "#,##0"
) & " lbs"

// Yield Variance (current year vs 5-year average)
Yield Variance vs 5yr Avg =
VAR CurrentYear = MAX(gold_usda_crop_summary[year])
VAR CurrentYield =
    CALCULATE(
        AVERAGE(gold_usda_crop_summary[avg_yield]),
        gold_usda_crop_summary[year] = CurrentYear
    )
VAR FiveYearAvg =
    CALCULATE(
        AVERAGE(gold_usda_crop_summary[avg_yield]),
        gold_usda_crop_summary[year] >= CurrentYear - 5,
        gold_usda_crop_summary[year] < CurrentYear
    )
RETURN
    DIVIDE(CurrentYield - FiveYearAvg, FiveYearAvg, 0)

// Recall Severity Label (for conditional formatting)
Recall Severity Label =
VAR AvgSeverity = AVERAGE(gold_usda_food_safety_dashboard[severity_weighted_score])
RETURN
    SWITCH(
        TRUE(),
        AvgSeverity >= 2.5, "Critical",
        AvgSeverity >= 2.0, "High",
        AvgSeverity >= 1.5, "Moderate",
        "Low"
    )

5.4 Suggested Dashboard Layout¶

Create a multi-page Power BI report with these pages:

Page 1: Crop Production Overview

Visual	Type	Data
US Production Map	Filled map	state_name (location), total_production (color saturation)
Commodity Trend Lines	Line chart	year (x-axis), total_production (y-axis), commodity (legend)
Top Producers Table	Table	commodity_rank, commodity, state_name, total_production, avg_yield, performance_flag
YoY Change KPI	Card	YoY Production Change measure
Avg Yield KPI	Card	Avg National Yield measure

Page 2: Food Safety Monitor

Visual	Type	Data
Recall Trend Area Chart	Stacked area chart	recall_year + recall_quarter (x-axis), total_recalls (y-axis), recall_class (legend)
Recall Reason Treemap	Treemap	most_common_reason (category), total_recalls (values)
Severity Score Gauge	Gauge	Avg Severity Score measure (min=1, max=3)
Pounds Recalled Bar Chart	Clustered bar	product_type (axis), total_pounds_recalled (values), risk_level (legend)
Active Recalls Table	Table	recall_year, recall_class, product_type, open_recalls, severity_weighted_score, recall_trend

Page 3: State Agriculture Scorecard

Visual	Type	Data
State Diversity Map	Filled map	state_name (location), crop_diversity_index (color)
Dominant Commodity Table	Matrix	state_name (rows), year (columns), dominant_commodity (values)
Production Value Bar	Clustered bar	state_name (axis), total_production_value (values)
Diversity vs Value Scatter	Scatter	crop_diversity_index (x), total_production_value (y), state_name (details)

Page 4: Executive Summary

Visual	Type	Data
Food Safety Index KPI	Card	Food Safety Index measure
Annual Metrics Table	Table	year, total_crop_production_records, total_recall_events, food_safety_index, top_3_commodities
Cross-Domain Trend	Dual-axis line	year (x), total_crop_production_records (left axis), total_recall_events (right axis)

5.5 Verify Dashboard¶

After building the report, verify:

Direct Lake mode is active (check semantic model properties -- should not show "DirectQuery fallback")
All four pages render correctly with data
Slicers for year, commodity, state, and recall class filter across pages
YoY calculations show correct trends
Food Safety Index ranges between 0-100

🛠️ Step 6: Data Quality with Great Expectations¶

Configure Great Expectations validation suites for both USDA data domains.

6.1 Crop Production Expectations¶

# Great Expectations suite: usda_crop_production
crop_expectations = {
    "suite_name": "usda_crop_production",
    "expectations": [
        # Non-null critical fields
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {"column": "commodity"},
        },
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {"column": "year"},
        },
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {"column": "state_fips"},
        },
        # Commodity allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "commodity",
                "value_set": [
                    "CORN", "SOYBEANS", "WHEAT", "COTTON", "RICE",
                    "BARLEY", "OATS", "SORGHUM", "HAY", "POTATOES"
                ],
            },
        },
        # FIPS format validation (2-digit state code)
        {
            "expectation_type": "expect_column_values_to_match_regex",
            "kwargs": {
                "column": "state_fips",
                "regex": r"^\d{2}$",
            },
        },
        # County FIPS format (5-digit, when present)
        {
            "expectation_type": "expect_column_values_to_match_regex",
            "kwargs": {
                "column": "county_fips",
                "regex": r"^\d{5}$",
                "mostly": 0.4,  # Only ~40% of records are county-level
            },
        },
        # Year range validation
        {
            "expectation_type": "expect_column_values_to_be_between",
            "kwargs": {
                "column": "year",
                "min_value": 2000,
                "max_value": 2030,
            },
        },
        # Value must be positive
        {
            "expectation_type": "expect_column_values_to_be_between",
            "kwargs": {
                "column": "value",
                "min_value": 0,
                "strict_min": False,
                "mostly": 0.99,
            },
        },
        # Statistic category allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "statisticcat_desc",
                "value_set": [
                    "AREA PLANTED", "AREA HARVESTED", "YIELD",
                    "PRODUCTION", "PRICE RECEIVED"
                ],
            },
        },
        # Aggregation level allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "agg_level_desc",
                "value_set": ["NATIONAL", "STATE", "COUNTY"],
            },
        },
    ],
}

6.2 Food Safety Expectations¶

# Great Expectations suite: usda_food_safety
food_safety_expectations = {
    "suite_name": "usda_food_safety",
    "expectations": [
        # Non-null critical fields
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {"column": "recall_number"},
        },
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {"column": "recall_date"},
        },
        # Recall number format: FSIS-YYYY-NNN
        {
            "expectation_type": "expect_column_values_to_match_regex",
            "kwargs": {
                "column": "recall_number",
                "regex": r"^FSIS-\d{4}-\d{3}$",
            },
        },
        # Recall class allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "recall_class",
                "value_set": ["Class I", "Class II", "Class III"],
            },
        },
        # Product type allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "product_type",
                "value_set": [
                    "BEEF", "POULTRY", "PORK", "PROCESSED",
                    "READY-TO-EAT", "IMPORTED", "OTHER"
                ],
            },
        },
        # Risk level allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "risk_level",
                "value_set": ["HIGH", "MEDIUM", "LOW"],
            },
        },
        # Pounds recalled must be positive when present
        {
            "expectation_type": "expect_column_values_to_be_between",
            "kwargs": {
                "column": "pounds_recalled",
                "min_value": 0,
                "strict_min": True,
                "mostly": 0.85,  # ~10-15% may have null pounds
            },
        },
        # Recall number uniqueness (after Silver dedup)
        {
            "expectation_type": "expect_column_values_to_be_unique",
            "kwargs": {"column": "recall_number"},
        },
        # Status allowlist
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "status",
                "value_set": ["OPEN", "CLOSED", "EXPANDED"],
            },
        },
    ],
}

6.3 Run Validation¶

# Run Great Expectations checkpoint for both suites
great_expectations checkpoint run usda_crop_production_checkpoint
great_expectations checkpoint run usda_food_safety_checkpoint

6.4 Verify Data Quality¶

After running the checkpoints, review the Data Docs report:

Crop production suite: all expectations passing at >95% threshold
Food safety suite: all expectations passing at >95% threshold
FIPS format validation shows zero failures
Commodity/recall class allowlists show zero out-of-set values
Recall number uniqueness confirmed after Silver dedup

🛠️ Step 7: Governance & Compliance¶

Configure Microsoft Purview governance for federal agricultural data classification, lineage tracking, and compliance with USDA data handling requirements.

7.1 Purview Classification for Agricultural Data¶

Configure sensitivity labels and data classification in Microsoft Purview for USDA data assets:

# Purview classification configuration for USDA data
purview_classification = {
    "workspace": "ws_federal_agriculture",
    "sensitivity_label": "Federal - Public Statistical Data",
    "data_classifications": {
        "bronze_usda_crop_production": {
            "classification": "Federal Statistical Data",
            "sensitivity": "Public",  # NASS data is publicly available
            "source_agency": "USDA-NASS",
            "governing_law": "CIPSEA",
            "retention_policy": "Permanent (statistical archive)",
            "pii_columns": [],  # No PII in crop production data
        },
        "bronze_usda_food_safety": {
            "classification": "Federal Regulatory Data",
            "sensitivity": "Public",  # FSIS recall data is public record
            "source_agency": "USDA-FSIS",
            "governing_law": "FSMA",
            "retention_policy": "7 years (regulatory records)",
            "pii_columns": [],  # Company names are public; no individual PII
            "sensitive_columns": ["company_name", "establishment_number"],
        },
    },
    "lineage_tracking": {
        "source_to_bronze": "Parquet/CSV ingestion with schema enforcement",
        "bronze_to_silver": "FIPS validation, standardization, dedup, DQ scoring",
        "silver_to_gold": "Aggregation, KPI computation, cross-domain joins",
        "gold_to_pbi": "Direct Lake semantic model connection",
    },
}

7.2 CIPSEA Compliance (Crop Production)¶

The Confidential Information Protection and Statistical Efficiency Act (CIPSEA) governs how federal statistical agencies collect, protect, and publish data.

Requirement	Implementation
No individual farm data	NASS publishes only aggregated statistics; individual responses are never released. The QuickStats API returns only pre-aggregated data at national, state, or county levels.
Suppression rules	NASS suppresses data where publication could identify individual respondents. Records with `(D)` in value fields indicate suppressed data. Handle these as nulls in ingestion.
Citation requirements	When publishing analytics derived from NASS data, cite: "Source: USDA National Agricultural Statistics Service, QuickStats"
No commercial misrepresentation	Do not use NASS data to create misleading commodity price forecasts or market manipulation tools

7.3 FSMA Compliance (Food Safety)¶

The Food Safety Modernization Act (FSMA) shifted US food safety focus from response to prevention.

Requirement	Implementation
Public recall transparency	FSIS recall data is public record. Your analytics pipeline helps fulfill FSMA's transparency goals by making recall patterns discoverable.
Timely reporting	Configure Fabric Activator alerts to notify stakeholders when new Class I recalls appear in the pipeline
Establishment accountability	Establishment numbers link to FSIS-inspected facilities. Handle with care in public-facing dashboards -- while data is public, aggregated patterns should not unfairly single out individual facilities
Record retention	FSMA requires 7-year retention for food safety records. Configure Delta Lake retention policies accordingly.

7.4 Data Retention Configuration¶

# Delta Lake retention policy for federal data compliance
retention_policies = {
    # Crop production: permanent archive (statistical records)
    "bronze_usda_crop_production": {
        "vacuum_retain_hours": 8760,  # 365 days for Delta log cleanup
        "data_retention": "permanent",
        "rationale": "Federal statistical archive under CIPSEA"
    },
    # Food safety: 7-year retention per FSMA
    "bronze_usda_food_safety": {
        "vacuum_retain_hours": 8760,
        "data_retention": "7 years",
        "rationale": "Regulatory records under FSMA Section 103"
    },
    # Silver and Gold: match source retention
    "silver_usda_*": {
        "vacuum_retain_hours": 168,  # 7 days for Delta log
        "data_retention": "matches source policy",
    },
    "gold_usda_*": {
        "vacuum_retain_hours": 168,
        "data_retention": "matches source policy",
    },
}

# Apply VACUUM to maintain Delta table health
for table_name in [
    "lh_bronze.bronze_usda_crop_production",
    "lh_bronze.bronze_usda_food_safety"
]:
    spark.sql(f"VACUUM {table_name} RETAIN 8760 HOURS")
    print(f"VACUUM applied: {table_name}")

7.5 Purview Lineage Verification¶

After running the full pipeline (Bronze → Silver → Gold → Power BI), verify in Microsoft Purview:

All six USDA Delta tables appear in the Purview data catalog
Lineage graph shows the complete flow from source files through Bronze, Silver, Gold, and into the Power BI semantic model
Classification labels are applied: "Federal Statistical Data" for crop production, "Federal Regulatory Data" for food safety
The company_name and establishment_number columns in food safety tables are flagged as sensitive (not PII, but commercially sensitive)
Access audit logs show which users accessed USDA data assets

🔧 Troubleshooting¶

Symptom	Likely Cause	Solution
`FileNotFoundException` when running Bronze notebook	Parquet files not uploaded to correct lakehouse path	Verify files exist at `Files/output/usda_crop_production.parquet` and `Files/output/usda_food_safety.parquet` in the lh_bronze lakehouse
Zero records after Bronze ingestion	Schema mismatch between generator output and notebook schema	Re-generate data with the same `USDAGenerator` version; check column names match exactly
NASS API returns 403 Forbidden	Invalid or expired API key	Request a new key at quickstats.nass.usda.gov; keys are free but must be requested
NASS API returns empty result set	Query parameters too restrictive	Broaden the query: remove county-level filter, expand year range, or reduce commodity specificity
Crop DQ score below 80 for many records	State FIPS codes not in validation lookup	The Silver notebook validates against 50 states; check if source data includes territories (PR, GU, VI) not in the lookup table
Food safety dedup removes too many records	Duplicate `recall_number` values in source data	This is expected for the synthetic generator with large batch sizes; real FSIS data has unique recall numbers
Gold `food_safety_index` shows 100.0 for all years	No food safety data joined to executive summary	Verify that `recall_year` in food safety Silver matches `year` in crop Silver (both should be integers)
Power BI Direct Lake falls back to DirectQuery	Gold tables not optimized	Run `OPTIMIZE ... ZORDER BY` on all Gold tables; reduce column count if table exceeds Direct Lake limits
`year_over_year_change_pct` is null	Only one year of data for that commodity-state combination	Expected for the first year in the dataset; ensure source data spans multiple years
Great Expectations FIPS regex fails	FIPS codes not zero-padded	Verify that state_fips is stored as a 2-character string, not an integer (e.g., "01" not 1)
Purview lineage graph incomplete	Notebook jobs not captured by Purview scanner	Ensure Purview integration is enabled at the workspace level and re-scan after running the full pipeline

📋 Best Practices¶

Use the synthetic generator for development, real data for production demos. The USDAGenerator produces schema-aligned data instantly, but real NASS QuickStats data carries the weight and nuance of actual agricultural patterns. For customer-facing demonstrations, real data is more compelling.
Validate FIPS codes at the Silver layer, not Bronze. Bronze is append-only raw storage. FIPS validation, commodity allowlisting, and other domain rules belong in Silver where they can be version-controlled and audited without re-ingesting raw data.
Partition crop data by year, food safety by recall_year. Agricultural queries are almost always time-bounded ("How did corn yields compare in 2023 vs 2024?"). Year-based partitioning ensures partition pruning on the most common filter dimension.
Z-Order on the most common filter-then-join columns. Crop summary is Z-Ordered on (commodity, year, state_name). Food safety dashboard is Z-Ordered on (recall_year, recall_quarter, recall_class). These match the most common slicer combinations in Power BI.
Use performance flags to drive dashboard interactivity. The performance_flag column in gold_usda_crop_summary (TOP_PRODUCER, DECLINING_YIELD, EXPANDING_ACREAGE) enables conditional formatting and drill-through actions in Power BI without complex DAX calculations.
Monitor the food safety severity score as an early warning indicator. A rising severity_weighted_score trend in gold_usda_food_safety_dashboard signals an increasing proportion of Class I recalls -- configure a Fabric Activator alert when the rolling 12-month average exceeds 2.5.
Cite USDA data sources in all published reports. Federal statistical data carries citation requirements. Include "Source: USDA-NASS QuickStats" and "Source: USDA-FSIS Recall Archive" in dashboard footers and exported reports.
Handle NASS suppressed values explicitly. Real NASS data uses (D) to indicate data suppressed to avoid disclosing individual farm operations. In the Bronze layer, these values will be read as strings. In Silver, cast them to null and flag with a DQ flag (SUPPRESSED_VALUE).
Keep food safety establishment numbers out of public dashboards. While FSIS recall data is public, aggregating recall frequency by establishment number in a public-facing dashboard could create legal risk. Use establishment data for internal analysis only.
Run the full pipeline on a schedule during growing season. If using real NASS data, configure a Fabric pipeline to refresh crop data monthly during the growing season (April-October) when NASS publishes frequent survey updates.

✅ Summary¶

Congratulations! You have built a federal agricultural analytics pipeline in Microsoft Fabric that processes USDA crop production and food safety data through a complete medallion architecture.

What You Accomplished¶

Configured a Fabric workspace for federal agricultural data analytics with appropriate classification and governance
Generated or downloaded USDA data spanning two domains: NASS crop production statistics and FSIS food safety recall records
Built a Bronze ingestion pipeline with explicit schema enforcement, critical field null checks, and ingestion metadata tracking
Transformed data through Silver layer processing with FIPS code validation, commodity standardization, recall class validation, deduplication, and data quality scoring (0-100)
Created four Gold layer analytics tables: crop summary with YoY trends and performance flags, state agriculture profiles with diversity metrics, a food safety dashboard with severity-weighted scoring and rolling trends, and a cross-domain executive summary with a food safety index
Built a Power BI Direct Lake dashboard with crop production maps, commodity trend lines, recall monitoring visualizations, and executive scorecards
Implemented Great Expectations data quality suites for both USDA data domains
Configured Microsoft Purview governance with federal data classification (CIPSEA, FSMA) and lineage tracking

Key Takeaways¶

Concept	Key Point
Federal Open Data	USDA datasets are publicly available under federal open data policies; the QuickStats API requires a free key while FSIS data is freely downloadable
Dual-Domain Pipeline	A single pipeline architecture handles both structured statistical data (crop production) and event-based data (food safety recalls) through domain-specific validation rules
Data Quality Scoring	Each Silver record receives a 0-100 DQ score based on domain-specific validation rules, enabling quality-aware analytics in Gold and Power BI
Performance Flags	Gold layer flags (TOP_PRODUCER, DECLINING_YIELD, EXPANDING_ACREAGE) turn raw aggregations into actionable insights without complex DAX
Food Safety Index	The inverse Class I recall rate (0-100) provides a single-number food safety health metric for executive dashboards
Federal Compliance	CIPSEA protects individual farm-level data; FSMA requires 7-year retention and transparency for recall records

🚀 Next Steps¶

Continue your learning journey:

Next Tutorial: Tutorial 33: SBA Small Business Analytics -- Build small business lending analytics pipelines with SBA PPP loan data, 7(a) loan programs, and NAICS industry analysis.

Related Tutorials: - Tutorial 01: Bronze Layer -- Foundational Bronze layer patterns used in this tutorial - Tutorial 02: Silver Layer -- Silver layer validation and standardization patterns - Tutorial 03: Gold Layer -- Gold layer aggregation and KPI computation patterns - Tutorial 05: Direct Lake & Power BI -- Deep dive into Direct Lake semantic models and Power BI connectivity - Tutorial 07: Governance & Purview -- Advanced Purview configuration for federal data classification - Tutorial 34: NOAA Weather & Climate Analytics -- Federal weather data analytics with potential cross-domain joins to crop production

📚 Resources¶

Resource	Link
USDA Data Portal	usda.gov/topics/data
NASS QuickStats API	quickstats.nass.usda.gov/api
FSIS Recall Archive	fsis.usda.gov/recalls
CIPSEA (Public Law 107-347)	bjs.gov/cipsea
FSMA Overview	fda.gov/fsma
Fabric Direct Lake	Microsoft Learn
Bronze Notebook	`notebooks/bronze/12_bronze_usda.py`
Silver Notebook	`notebooks/silver/12_silver_usda.py`
Gold Notebook	`notebooks/gold/12_gold_usda_analytics.py`
Data Generator	`data_generation/generators/federal/usda_generator.py`
Open Data Download	`data_generation/open_data/usda_download.py`
Dataset Config	`data_generation/config/federal_datasets.yaml`
Unit Tests	`validation/unit_tests/federal/test_usda_generator.py`

Previous	Up	Next
⬅️ 31-Federal DOT/FAA Analytics	📖 Tutorials Index	33-SBA Small Business Analytics ➡️

Questions or issues? Open an issue in the GitHub repository

Tutorial 32 of 36 in the Microsoft Fabric Casino POC Series

← Previous31 — Federal DOT/FAARead more →Next →33 — SBA Small BusinessRead more →

🌾 Tutorial 32: USDA Agriculture Analytics¶

🌾 Tutorial 32: USDA Agriculture Analytics¶

📊 Progress Tracker¶

📖 Overview¶

🎯 Learning Objectives¶

🏗️ Architecture Diagram¶

📋 Prerequisites¶

🛠️ Step 1: Data Source Setup¶

Option A: Synthetic Data Generation¶

Option B: Real Open Data Download¶

Step B.1: Obtain a NASS API Key¶

Step B.2: Run the Download Script¶

Step B.3: Verify Downloaded Data¶

Verification Checkpoint¶

🛠️ Step 2: Bronze Layer Ingestion¶

2.1 Upload Data to Fabric Lakehouse¶

2.2 Run the Bronze Ingestion Notebook¶

2.3 Verify Bronze Ingestion¶

🛠️ Step 3: Silver Layer Transformation¶

3.1 Run the Silver Transformation Notebook¶

3.2 Crop Production Transformations¶

3.3 Food Safety Transformations¶

3.4 Verify Silver Transformations¶

🛠️ Step 4: Gold Layer Analytics¶

4.1 Run the Gold Analytics Notebook¶

4.2 Table 1: Crop Summary (gold_usda_crop_summary)¶

4.3 Table 2: State Agriculture Profile (gold_usda_state_agriculture)¶

4.4 Table 3: Food Safety Dashboard (gold_usda_food_safety_dashboard)¶

4.5 Table 4: Executive Summary (gold_usda_executive_summary)¶

4.6 Verify Gold Layer¶

🛠️ Step 5: Power BI Dashboard¶

5.1 Create the Semantic Model¶

5.2 Define Relationships¶

5.3 DAX Measures¶

5.4 Suggested Dashboard Layout¶

5.5 Verify Dashboard¶

🛠️ Step 6: Data Quality with Great Expectations¶

6.1 Crop Production Expectations¶

6.2 Food Safety Expectations¶

6.3 Run Validation¶

6.4 Verify Data Quality¶

🛠️ Step 7: Governance & Compliance¶

7.1 Purview Classification for Agricultural Data¶

7.2 CIPSEA Compliance (Crop Production)¶

7.3 FSMA Compliance (Food Safety)¶

7.4 Data Retention Configuration¶

7.5 Purview Lineage Verification¶

🔧 Troubleshooting¶

📋 Best Practices¶

✅ Summary¶

What You Accomplished¶

Key Takeaways¶

🚀 Next Steps¶

📚 Resources¶

🧭 Navigation¶

4.2 Table 1: Crop Summary (`gold_usda_crop_summary`)¶

4.3 Table 2: State Agriculture Profile (`gold_usda_state_agriculture`)¶

4.4 Table 3: Food Safety Dashboard (`gold_usda_food_safety_dashboard`)¶

4.5 Table 4: Executive Summary (`gold_usda_executive_summary`)¶