Home > Docs > Tutorials > Great Expectations
Great Expectations Tutorial (CSA-0074)¶
Tip
TL;DR - Stand up a Great Expectations 1.x project, write an expectation suite against ADLS Gen2 Bronze/Silver Delta tables, run a checkpoint as a PR gate, and plug the result into the CSA-in-a-Box governance runner (csa_platform/governance/dataquality/ge_runner.py). A fully runnable example lives at csa_platform/governance/dataquality/ge_example/.
Great Expectations (GE) is the data-quality backbone of CSA-in-a-Box. The platform's governance runner already has a stable interface for executing expectation suites; this tutorial takes you from empty folder to green PR in eight numbered steps.
Table of Contents¶
- 0. Prerequisites
- 1. Install and verify
- 2. Connect to ADLS Gen2 Bronze and Silver
- 3. Author an Expectation Suite
- 4. Create a Checkpoint
- 5. Run via CLI and the programmatic API
- 6. Read the Data Docs
- 7. Integrate with the CSA governance runner
- 8. Wire it into CI as a PR gate
- 9. Related
0. Prerequisites¶
- Python 3.10+
- An Azure subscription with ADLS Gen2 (you can follow along locally with Pandas - a concrete ADLS snippet is included at the end of each step).
-
Access to the CSA-in-a-Box repo. From the repo root:
The
tutorialsextra pinsgreat-expectations>=1.0.0. GE is intentionally not a core dependency of the repo - it is ~200 MB installed and only the data-quality surface needs it. -
The runnable companion example for this tutorial lives at
csa_platform/governance/dataquality/ge_example/. Every snippet in this tutorial is exercised byge_example/tests/test_ge_demo.py.
1. Install and verify¶
-
Create the project directory.
mkdir -p csa_platform/governance/dataquality/ge_example cd csa_platform/governance/dataquality/ge_example(The repo already contains this folder. Re-run the next step in a scratch directory if you prefer to start from empty.)
-
Install GE via the
tutorialsextra. -
Point GE at the config shipped with the example.
import great_expectations as gx context = gx.get_context(project_root_dir="csa_platform/governance/dataquality/ge_example") print(type(context).__name__) # -> FileDataContextUsing the shipped
great_expectations.ymlrather than the defaultgx initoutput keeps the CSA governance runner happy: stores, checkpoints, and data_docs all land under predictable paths.
2. Connect to ADLS Gen2 Bronze and Silver¶
Great Expectations 1.x supports two realistic production paths for CSA-in-a-Box:
2a. Spark datasource (Databricks)¶
Inside a Databricks notebook (where spark is pre-bound):
import great_expectations as gx
context = gx.get_context(project_root_dir="<repo>/csa_platform/governance/dataquality/ge_example")
data_source = context.data_sources.add_spark(name="adls_bronze")
data_asset = data_source.add_dataframe_asset(name="noaa_observations_bronze")
batch_def = data_asset.add_batch_definition_whole_dataframe(name="whole_df")
# ADLS Gen2 Delta table read through Unity Catalog:
bronze_df = spark.table("csa_bronze.noaa.observations")
# The DataFrame is passed at checkpoint run time via batch_parameters.
2b. Pandas-on-Parquet (local dev / CI smoke tests)¶
When you want a fast feedback loop on your laptop:
import pandas as pd
import great_expectations as gx
context = gx.get_context(mode="ephemeral")
df = pd.read_parquet("data/seed/observations_bronze.parquet")
data_source = context.data_sources.add_pandas(name="csa_local")
data_asset = data_source.add_dataframe_asset(name="noaa_observations_bronze")
batch_def = data_asset.add_batch_definition_whole_dataframe(name="whole_df")
Silver tables reuse the same pattern - just bind a different DataFrame. Referential-integrity expectations (e.g. every Silver station_id exists in the station catalog) are authored in a second suite that joins the two in Spark before validation.
3. Author an Expectation Suite¶
The shipped suite covers four categories:
- Column types -
expect_column_to_existfor each required column. - Null rates -
expect_column_values_to_not_be_nullfor the primary key. - Value domains -
expect_column_values_to_be_in_setfor enumerations,expect_column_values_to_be_betweenfor numeric ranges. - Referential integrity - enforced by a companion suite that joins the observations to the station catalog (authored in the same folder).
Build the suite programmatically¶
import great_expectations as gx
from great_expectations import expectations as gxe
suite = gx.ExpectationSuite(name="noaa_observations_suite")
# Shape
suite.add_expectation(gxe.ExpectTableRowCountToBeBetween(min_value=1, max_value=100_000))
suite.add_expectation(gxe.ExpectColumnToExist(column="station_id"))
suite.add_expectation(gxe.ExpectColumnToExist(column="observation_datetime"))
suite.add_expectation(gxe.ExpectColumnToExist(column="air_temperature_c"))
# Null rates
suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="station_id"))
suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="observation_datetime"))
# Domains and ranges (mirrors examples/noaa/contracts/ocean-buoys.yaml)
suite.add_expectation(gxe.ExpectColumnValuesToBeInSet(column="station_type", value_set=["Buoy", "C-MAN"]))
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(column="latitude", min_value=-90, max_value=90))
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(column="longitude", min_value=-180, max_value=180))
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(column="air_temperature_c", min_value=-60, max_value=60, mostly=0.99))
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(column="pressure_hpa", min_value=870, max_value=1084, mostly=0.99))
context.suites.add(suite)
Or load from JSON¶
CSA-in-a-Box ships expectation suites as JSON artifacts so they are diff-friendly in code review. See ge_demo.py::_build_suite for the exact load-from-JSON pattern the tests exercise.
4. Create a Checkpoint¶
A Checkpoint binds a batch definition to a suite and a set of actions (rebuild Data Docs, post a Slack notification, publish to a metrics endpoint, ...).
import great_expectations as gx
validation_definition = gx.ValidationDefinition(
data=batch_def,
suite=suite,
name="noaa_observations_validation",
)
context.validation_definitions.add(validation_definition)
checkpoint = gx.Checkpoint(
name="daily_quality",
validation_definitions=[validation_definition],
result_format={"result_format": "COMPLETE"},
)
context.checkpoints.add(checkpoint)
The declarative twin of this checkpoint is shipped at checkpoints/daily_quality.yml so operators can read the same intent from YAML.
5. Run via CLI and the programmatic API¶
Programmatic¶
result = checkpoint.run(batch_parameters={"dataframe": df})
assert result.success, "daily_quality checkpoint failed"
From the repo CLI¶
python csa_platform/governance/dataquality/ge_example/ge_demo.py
# -> PASS expect_table_row_count_to_be_between
# PASS expect_column_to_exist ...
# Checkpoint: noaa_observations_suite rows=24 expectations=12/12 success=True
Exit code is 0 when every expectation passes and 1 otherwise, so the demo is usable as a pre-commit hook or a CI step out of the box.
One-off smoke test (no repo required)¶
from csa_platform.governance.dataquality.ge_example.ge_demo import run_demo
print(run_demo(rows=24, verbose=True))
# DemoResult(success=True, total_expectations=12, ...)
6. Read the Data Docs¶
Great Expectations' HTML Data Docs are the single most valuable diagnostic the toolchain ships. Build them with:
Data Docs land under csa_platform/governance/dataquality/ge_example/uncommitted/data_docs/local_site/index.html. Open that file in a browser to see:
- Suite overview (every expectation, current pass / fail status).
- Validation history (every checkpoint run, linked to the suite version).
- Profiling reports (distributions, null rates, uniqueness).
In production, the site_builder target is usually an Azure Static Web App or an ADLS public container - swap the store_backend under data_docs_sites.local_site in great_expectations.yml.
7. Integrate with the CSA governance runner¶
CSA-in-a-Box ships a runner at csa_platform/governance/dataquality/ge_runner.py that:
- Loads the declarative rules from
quality-rules.yaml. - Delegates to a real GE checkpoint when
great_expectationsis importable and sample data is not injected. - Falls back to an in-memory evaluator for unit tests / CI where a Spark cluster is unavailable.
To wire the tutorial's checkpoint into the runner:
- Add a
great_expectations.suites[]entry toquality-rules.yaml(the fallback evaluator will pick it up automatically). - Drop the checkpoint YAML at the path indicated by
GE_CHECKPOINT_DIR(default:csa_platform/great_expectations/checkpoints/). -
From a Databricks notebook with an active
SparkSession:
The runner reports every suite outcome through the shared structlog surface (csa_platform.governance.common.logging) so checkpoint runs show up alongside the rest of the governance telemetry in Log Analytics.
8. Wire it into CI as a PR gate¶
Add a single job to the existing governance workflow (or to a new data-quality.yml):
name: Data Quality
on:
pull_request:
paths:
- "csa_platform/governance/dataquality/**"
- "domains/**/models/**"
jobs:
checkpoint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install tutorials extra
run: pip install -e ".[tutorials]"
- name: Run checkpoint
run: |
python csa_platform/governance/dataquality/ge_example/ge_demo.py
- name: Run suite tests
run: pytest csa_platform/governance/dataquality/ge_example/tests/ -v
The job fails the PR when any expectation fails. Data Docs are built locally during the run and can be uploaded as an artifact for review.
9. Related¶
csa_platform/governance/dataquality/ge_example/- runnable example referenced throughout.csa_platform/governance/dataquality/ge_runner.py- platform checkpoint runner with fallback evaluator.docs/runbooks/dbt-ci.md- dbt PR gate this tutorial complements.examples/noaa/contracts/ocean-buoys.yaml- contract that the shipped expectation suite mirrors.- Great Expectations 1.x docs.