Data Quality Migration Guide: IDQ to Great Expectations + dbt Tests + Purview¶

A comprehensive guide for migrating Informatica Data Quality (IDQ) profiles, scorecards, rules, and standardization to Azure-native data quality tools.

Overview¶

Informatica Data Quality (IDQ) is a mature data quality platform offering profiling, standardization, matching, address validation, scorecards, and exception management. The Azure-native replacement is a combination of purpose-built tools, each handling a specific aspect of data quality:

IDQ capability	Azure replacement	Role
Data profiling	Purview data profiling + Great Expectations profiler	Automated column statistics and distribution analysis
Data quality rules	dbt tests (schema + custom) + Great Expectations	Rule-as-code with CI/CD integration
Scorecards	dbt test results + Power BI dashboard	Custom dashboards on test metadata
Standardization	dbt models (SQL-based cleansing)	SQL transformations for format normalization
Address validation	Azure Maps API or third-party (Melissa, SmartyStreets)	API-based validation
Duplicate detection	dbt dedup models + Azure ML (fuzzy matching)	SQL-based or ML-based
Reference data management	dbt seeds + Azure SQL reference tables	CSV-based or database-based reference data
Exception management	dbt test failures + Power Automate workflows	Automated exception routing

Architecture comparison¶

graph TB
    subgraph "IDQ Architecture"
        IDQ_Profile[Data Profile] --> IDQ_Score[Scorecard]
        IDQ_Rule[DQ Rules] --> IDQ_Score
        IDQ_Std[Standardization] --> IDQ_Rule
        IDQ_Addr[Address Validation] --> IDQ_Rule
        IDQ_Match[Match/Merge] --> IDQ_Rule
        IDQ_Exc[Exception Manager] --> IDQ_Score
    end

    subgraph "Azure Architecture"
        Purview[Purview Profiling] --> PBI[Power BI DQ Dashboard]
        dbt_Test[dbt Tests] --> PBI
        GE[Great Expectations] --> PBI
        dbt_Model[dbt Cleansing Models] --> dbt_Test
        AzMaps[Azure Maps API] --> dbt_Model
        dbt_Dedup[dbt Dedup Models] --> dbt_Test
        PA[Power Automate] --> PBI
    end

IDQ profiles to Purview + Great Expectations¶

What IDQ profiling provides¶

IDQ profiles analyze source data to produce:

Column-level statistics (null count, unique count, min, max, mean, standard deviation)
Value frequency distributions
Pattern analysis (format detection)
Data type inference
Cross-column dependency analysis

Purview data profiling¶

Purview automated scanning includes profiling capabilities:

Purview Scan -> Data Map -> Column Statistics
  - Null percentage
  - Distinct value count
  - Min / Max values
  - Value distribution (sample)
  - Data classification (PII, financial, health)

Setup: Configure a Purview scan rule set to enable profiling:

Navigate to Purview Studio -> Data Map -> Scan rule sets
Create a custom scan rule set with profiling enabled
Assign the rule set to your data source scan
Schedule the scan (daily or weekly)

Great Expectations profiling¶

For deeper profiling beyond Purview's capabilities, use Great Expectations:

# Generate an automated expectation suite from data profile
import great_expectations as gx

context = gx.get_context()
datasource = context.sources.add_or_update_sql(
    name="azure_sql",
    connection_string="mssql+pyodbc://...",
)

# Profile a table
suite = context.assistants.onboarding.run(
    datasource_name="azure_sql",
    data_asset_name="stg_customers",
)

# Generates expectations like:
# - expect_column_values_to_not_be_null (for columns with 0% null in profile)
# - expect_column_values_to_be_unique (for columns with 100% distinct)
# - expect_column_values_to_be_between (for numeric columns with min/max)
# - expect_column_value_lengths_to_be_between (for string columns)

Migration approach¶

IDQ profile element	Azure equivalent	Implementation
Column null analysis	Purview profiling + dbt `not_null` test	Automatic with Purview scan; codified with dbt
Unique analysis	Purview distinct count + dbt `unique` test
Value distribution	Great Expectations `expect_column_distinct_values_to_be_in_set`
Pattern analysis	Great Expectations `expect_column_values_to_match_regex`
Data type validation	dbt schema tests + `cast()` in models
Cross-column dependency	Custom dbt test or Great Expectations `expect_column_pair_values`

IDQ scorecards to dbt test results + Power BI¶

What IDQ scorecards provide¶

IDQ scorecards aggregate data quality metrics into visual dashboards:

Quality score per table, column, or business domain
Trend analysis (quality over time)
Drill-down from score to failing records
Threshold-based alerts (quality drops below X%)

Building the Azure equivalent¶

Step 1: dbt tests produce structured results

Every dbt test run produces a result in the dbt manifest and run results:

# models/staging/_stg__models.yml
version: 2
models:
    - name: stg_customers
      columns:
          - name: customer_id
            tests:
                - unique
                - not_null
          - name: email
            tests:
                - not_null
                - dbt_expectations.expect_column_values_to_match_regex:
                      regex: "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
          - name: phone
            tests:
                - dbt_expectations.expect_column_values_to_match_regex:
                      regex: "^\\+?[1-9]\\d{1,14}$"
                      config:
                          severity: warn
          - name: status
            tests:
                - accepted_values:
                      values: ["active", "inactive", "pending", "suspended"]
          - name: created_date
            tests:
                - not_null
                - dbt_expectations.expect_column_values_to_be_between:
                      min_value: "'2010-01-01'"
                      max_value: "CURRENT_DATE"

Step 2: Store test results for trending

Create an elementary-compatible results table or use dbt artifacts:

-- models/meta/fct_dbt_test_results.sql
-- This model reads dbt test results from the artifacts table
-- Populated by dbt Cloud's built-in artifact storage or elementary package

{{ config(materialized='incremental', unique_key='test_execution_id') }}

SELECT
    test_execution_id,
    test_name,
    model_name,
    column_name,
    test_status,  -- 'pass', 'fail', 'warn', 'error'
    failures,     -- count of failing rows
    rows_tested,  -- total rows tested
    CAST(failures AS FLOAT) / NULLIF(rows_tested, 0) AS failure_rate,
    1.0 - (CAST(failures AS FLOAT) / NULLIF(rows_tested, 0)) AS quality_score,
    executed_at
FROM {{ ref('elementary_test_results') }}

{% if is_incremental() %}
WHERE executed_at > (SELECT MAX(executed_at) FROM {{ this }})
{% endif %}

Step 3: Power BI scorecard dashboard

Build a Power BI report on fct_dbt_test_results:

Dashboard component	Power BI implementation
Overall quality score	Card visual: `AVG(quality_score)` with conditional formatting
Quality trend	Line chart: `quality_score` by `executed_at` (daily)
Quality by domain	Bar chart: `AVG(quality_score)` grouped by schema/model
Failing tests detail	Table: test_name, model_name, failures, failure_rate
Threshold alerts	Data Activator or Power Automate trigger on score < threshold

IDQ rules to dbt tests¶

Rule type mapping¶

IDQ rule type	dbt test equivalent	Example
Not null	`not_null` (built-in)	`tests: [not_null]`
Unique	`unique` (built-in)	`tests: [unique]`
Referential integrity	`relationships` (built-in)	`tests: [relationships: {to: ref('dim_customer'), field: customer_id}]`
Accepted values	`accepted_values` (built-in)	`tests: [accepted_values: {values: ['A', 'B', 'C']}]`
Range check	`dbt_expectations.expect_column_values_to_be_between`	min_value / max_value
Pattern match (regex)	`dbt_expectations.expect_column_values_to_match_regex`	regex pattern
Length check	`dbt_expectations.expect_column_value_lengths_to_be_between`	min_length / max_length
Conditional rule	Custom dbt test (SQL)	See below
Cross-column rule	Custom dbt test (SQL)	See below
Aggregate rule	Custom dbt test (SQL)	See below
Freshness check	`dbt source freshness`	`freshness: {warn_after: {count: 24, period: hour}}`

Custom dbt test example¶

IDQ custom rule: "If order_status = 'shipped', then ship_date must not be null"

-- tests/assert_shipped_orders_have_ship_date.sql
SELECT
    order_id,
    order_status,
    ship_date
FROM {{ ref('stg_orders') }}
WHERE order_status = 'shipped'
  AND ship_date IS NULL

dbt custom tests return the failing rows. If zero rows are returned, the test passes.

Great Expectations for complex rules¶

For rules that are complex to express in SQL or need statistical validation:

# Great Expectations suite for customer data
expectations = [
    # Column-level expectations
    gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id"),
    gx.expectations.ExpectColumnValuesToBeUnique(column="customer_id"),

    # Distribution expectations (statistical)
    gx.expectations.ExpectColumnMeanToBeBetween(
        column="order_amount", min_value=10, max_value=10000
    ),
    gx.expectations.ExpectColumnStdevToBeBetween(
        column="order_amount", min_value=5, max_value=5000
    ),

    # Multi-column expectations
    gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
        column_A="end_date", column_B="start_date"
    ),

    # Row count expectations
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=1000, max_value=10000000
    ),
]

Standardization migration¶

IDQ standardization to dbt models¶

IDQ standardization transforms include case normalization, whitespace trimming, format standardization, and abbreviation expansion. These translate directly to SQL in dbt models:

| IDQ standardization | dbt SQL equivalent | | ------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------- | | Upper case | UPPER(column_name) | | Lower case | LOWER(column_name) | | Proper case | CONCAT(UPPER(LEFT(column_name, 1)), LOWER(SUBSTRING(column_name, 2, LEN(column_name)))) | | Trim whitespace | TRIM(column_name) or LTRIM(RTRIM(column_name)) | | Remove special characters | REPLACE(REPLACE(column_name, '-', ''), '.', '') | | Phone format | Custom macro with regex | See example below | | Date format | CONVERT(DATE, column_name, format_code) or TRY_CAST | | Gender standardization | CASE statement | See example below | | State abbreviation | dbt seed + JOIN | CSV of state names to abbreviations |

Standardization macro example¶

-- macros/standardize_phone.sql
{% macro standardize_phone(phone_column) %}
    CASE
        WHEN {{ phone_column }} IS NULL THEN NULL
        WHEN LEN(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
            {{ phone_column }}, '-', ''), '(', ''), ')', ''), ' ', ''), '+', '')) = 10
        THEN CONCAT('+1',
            REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
                {{ phone_column }}, '-', ''), '(', ''), ')', ''), ' ', ''), '+', ''))
        WHEN LEN(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
            {{ phone_column }}, '-', ''), '(', ''), ')', ''), ' ', ''), '+', '')) = 11
        THEN CONCAT('+',
            REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
                {{ phone_column }}, '-', ''), '(', ''), ')', ''), ' ', ''), '+', ''))
        ELSE {{ phone_column }}
    END
{% endmacro %}

Standardization model example¶

-- models/staging/crm/stg_crm__customers.sql
SELECT
    customer_id,
    TRIM(UPPER(first_name)) AS first_name,
    TRIM(UPPER(last_name)) AS last_name,
    LOWER(TRIM(email)) AS email,
    {{ standardize_phone('phone') }} AS phone,
    CASE
        WHEN UPPER(TRIM(gender)) IN ('M', 'MALE', 'MAN') THEN 'M'
        WHEN UPPER(TRIM(gender)) IN ('F', 'FEMALE', 'WOMAN') THEN 'F'
        ELSE 'U'
    END AS gender,
    COALESCE(TRY_CAST(birth_date AS DATE), NULL) AS birth_date,
    s.state_abbrev AS state_code
FROM {{ source('crm', 'customers') }} c
LEFT JOIN {{ ref('seed_state_abbreviations') }} s
    ON UPPER(TRIM(c.state_name)) = UPPER(s.state_name)

Address validation migration¶

IDQ address validation¶

IDQ includes built-in address validation databases (from Loqate/GBG, formerly QAS) that validate, standardize, and geocode addresses. This is one of the few IDQ capabilities that requires a third-party replacement.

Azure-native options¶

Option	Capability	Cost	Notes
Azure Maps (Geocoding API)	Address validation + geocoding	Consumption-based (~$4.50/1000 requests)	Good for US/international; limited standardization
Melissa (Data Quality APIs)	Full address validation + standardization	Subscription ($5K-$50K/year)	Closest IDQ equivalent; USPS CASS certified
SmartyStreets	US address validation	Subscription ($3K-$20K/year)	US-focused; USPS CASS certified
Google Maps Geocoding	Address validation + geocoding	Consumption-based ($5/1000 requests)	Broad international coverage
USPS Web Tools (free)	US address validation	Free (USPS registration)	Limited to USPS addresses; rate-limited

Integration pattern¶

-- Step 1: dbt model identifies addresses needing validation
-- models/staging/stg_addresses__pending_validation.sql
SELECT
    address_id,
    street_line_1,
    street_line_2,
    city,
    state,
    postal_code,
    country
FROM {{ source('crm', 'addresses') }}
WHERE validated_at IS NULL
   OR validated_at < DATEADD(month, -6, GETDATE())

# Step 2: Azure Function calls validation API
# Triggered by ADF pipeline after dbt staging model runs

import azure.functions as func
import requests

def validate_address(address):
    response = requests.get(
        "https://atlas.microsoft.com/search/address/json",
        params={
            "api-version": "1.0",
            "subscription-key": os.environ["AZURE_MAPS_KEY"],
            "query": f"{address['street_line_1']}, {address['city']}, {address['state']} {address['postal_code']}",
            "countrySet": address.get("country", "US"),
        }
    )
    result = response.json()
    if result["results"]:
        top = result["results"][0]
        return {
            "validated_street": top["address"]["streetName"],
            "validated_city": top["address"]["municipality"],
            "validated_state": top["address"]["countrySubdivision"],
            "validated_postal": top["address"]["postalCode"],
            "latitude": top["position"]["lat"],
            "longitude": top["position"]["lon"],
            "confidence_score": top["score"],
        }
    return None

-- Step 3: dbt model joins validated addresses
-- models/intermediate/int_addresses__validated.sql
SELECT
    a.address_id,
    COALESCE(v.validated_street, a.street_line_1) AS street_line_1,
    a.street_line_2,
    COALESCE(v.validated_city, a.city) AS city,
    COALESCE(v.validated_state, a.state) AS state,
    COALESCE(v.validated_postal, a.postal_code) AS postal_code,
    v.latitude,
    v.longitude,
    v.confidence_score,
    CASE WHEN v.confidence_score >= 0.8 THEN 'validated'
         WHEN v.confidence_score >= 0.5 THEN 'partial'
         ELSE 'unvalidated'
    END AS validation_status
FROM {{ ref('stg_addresses__pending_validation') }} a
LEFT JOIN {{ source('validation', 'address_validation_results') }} v
    ON a.address_id = v.address_id

Data profiling migration¶

IDQ profiling to Purview scanning + profiling¶

IDQ profiling feature	Purview equivalent	Setup
Column statistics	Purview scan with profiling enabled	Enable in scan rule set
Null percentage	Purview column-level null count	Automatic with profiling
Distinct value count	Purview distinct count	Automatic with profiling
Value distribution (top N)	Purview sample values	Automatic with profiling
Pattern detection	Purview classification rules	Built-in classifiers for PII, dates, emails, etc.
Custom profiling rules	Great Expectations custom expectations	Python-based for complex rules

Setting up Purview profiling¶

Register data sources in Purview Data Map
Create scan rule set with profiling enabled:
- Navigate to Management -> Scan rule sets -> New
- Enable "Data profiling" checkbox
- Select classification rules to apply
Create and run scan against registered sources
View results in Purview Data Catalog -> asset details -> Schema tab

Exception management migration¶

IDQ exception management¶

IDQ routes data quality exceptions to exception tables where data stewards review and resolve issues. The Azure equivalent:

IDQ exception feature	Azure equivalent
Exception table	dbt test failure output table
Exception routing	Power Automate flow triggered on test failure
Steward assignment	Power Automate -> Teams/email notification
Resolution workflow	Power Apps form for steward review
Exception metrics	Power BI dashboard on exception table

Implementation pattern¶

Step 1: Configure dbt to store test failures:

# dbt_project.yml
tests:
    +store_failures: true
    +schema: dq_exceptions

This creates a table for each failing test containing the failing rows.

Step 2: Create a Power Automate flow:

Trigger: Scheduled (after dbt test run)
Action 1: Query dq_exceptions schema for new failures
Action 2: If failures exist:
  - Create Teams message with failure summary
  - Assign to data steward based on schema/model
  - Log exception to tracking table
Action 3: Update exception dashboard data

Step 3: Build Power Apps form for steward review:

View failing records from dq_exceptions tables
Mark records as "resolved", "accepted exception", or "false positive"
Record resolution notes and timestamp

Migration execution timeline¶

Phase	Duration	Activities
1. Inventory IDQ rules	2-3 weeks	Export all profiles, rules, scorecards; categorize by complexity
2. Set up dbt tests	2-3 weeks	Implement schema tests, custom tests, Great Expectations suites
3. Build scorecard dashboard	1-2 weeks	Power BI report on dbt test results
4. Migrate standardization	2-4 weeks	Convert IDQ standardization to dbt models + macros
5. Set up address validation	2-3 weeks	Integrate Azure Maps or third-party API
6. Parallel run	4-6 weeks	Run IDQ and dbt tests in parallel; reconcile results
7. Cutover	1-2 weeks	Decommission IDQ; full reliance on dbt tests + GE

MDM Migration Guide -- For match/merge and entity resolution
Complete Feature Mapping -- All IDQ features mapped
Tutorial: Mapping to dbt -- Includes test creation
Best Practices -- DQ rule discovery guidance
Migration Playbook -- End-to-end migration guide

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team