SAS to Azure: Migration Best Practices¶

Audience: Migration Program Managers, Data Engineering Leads, Analytics Directors Purpose: Workforce reskilling program, dual-running validation methodology, phased program migration strategy, output reconciliation framework, and CSA-in-a-Box as the unified analytics landing zone.

1. Workforce reskilling program¶

1.1 The reskilling imperative¶

SAS programmers are not being replaced --- they are being upskilled. A SAS programmer who adds Python to their skill set becomes a more valuable analyst because they understand both statistical methodology (from SAS training) and modern tooling (from Python). The reskilling program should be framed as a career investment, not a displacement.

1.2 Reskilling curriculum¶

Phase 1: Python foundations (Weeks 1--2)

Topic	Hours	Objective	SAS programmer note
Python syntax and data types	8	Write basic Python scripts	`data _null_; put "Hello";` becomes `print("Hello")`
pandas fundamentals	12	Read, filter, transform, merge DataFrames	Replaces DATA Step for most operations
numpy basics	4	Array operations, mathematical functions	Replaces SAS functions (mean, std, log, etc.)
Jupyter/Fabric notebooks	4	Interactive analysis workflow	Replaces SAS Enterprise Guide
Python functions and modules	8	Write reusable code	Replaces SAS macros
Total Phase 1	36 hours

Phase 2: Statistics and visualization (Weeks 3--4)

Topic	Hours	Objective	SAS programmer note
scipy.stats	8	Hypothesis tests, distributions	Replaces PROC FREQ (chi-square), PROC UNIVARIATE
statsmodels regression	12	Linear, logistic regression with diagnostics	Replaces PROC REG, PROC LOGISTIC
matplotlib and seaborn	8	Statistical graphics	Replaces PROC SGPLOT, SAS/GRAPH
plotly (interactive)	4	Interactive visualizations	Enhances SAS VA capabilities
pandas advanced (groupby, pivot, window)	8	Complex data manipulation	Replaces PROC MEANS, PROC TRANSPOSE, BY-group
Total Phase 2	40 hours

Phase 3: Platform and ML (Weeks 5--6)

Topic	Hours	Objective	SAS programmer note
PySpark fundamentals	12	Large-scale data processing	Replaces SAS CAS / SAS Grid for large datasets
scikit-learn	12	Machine learning pipelines	Replaces SAS Enterprise Miner
MLflow basics	8	Experiment tracking, model registry	Replaces SAS Model Manager
dbt fundamentals	8	SQL transformations, testing	Replaces SAS DI Studio
Azure ML workspace	4	Cloud ML platform	New capability
Fabric notebooks	4	Fabric-specific notebook features	Replaces SAS Studio
Total Phase 3	48 hours

Phase 4: Applied migration (Weeks 7--8)

Topic	Hours	Objective	SAS programmer note
Convert 3 real SAS programs	20	Hands-on migration practice	Use analyst's own programs
Output validation	8	Compare SAS and Python outputs	Critical skill for dual-running
Power BI fundamentals	8	Basic report building	Replaces SAS VA
Code review and collaboration	4	Git, pull requests, code standards	New workflow for most SAS teams
Total Phase 4	40 hours

Total program: 164 hours (~4 weeks full-time or 8 weeks half-time)

1.3 Training delivery recommendations¶

Delivery method	Best for	Cost per person	Notes
Instructor-led (in-person)	Groups of 10--20	$3K--$5K	Highest engagement; allows real-time Q&A
Instructor-led (virtual)	Distributed teams	$2K--$4K	Effective if well-structured; record sessions
Self-paced online	Individual learners	$500--$1K	Coursera, DataCamp, LinkedIn Learning
Paired programming	Post-training reinforcement	Internal cost only	Pair SAS programmer with Python mentor
Hackathon / conversion sprint	Post-training application	Internal cost only	Convert real SAS programs in a team setting

1.4 Recommended learning resources¶

Resource	Cost	Level	Notes
DataCamp "Python for SAS Users"	$300/year	Beginner	Specifically designed for SAS-to-Python transition
Coursera "Python for Data Science" (UMich)	$50/month	Beginner	Comprehensive; includes pandas and sklearn
"Python for SAS Users" by Randy Betancourt	$50 (book)	Intermediate	The definitive SAS-to-Python reference book
"Effective Pandas" by Matt Harrison	$40 (book)	Intermediate	Deep pandas skills
Microsoft Learn "Azure ML" path	Free	Intermediate	Azure-specific ML training
Fabric Learn path	Free	Beginner	Microsoft Fabric training

1.5 Measuring reskilling success¶

Metric	Target	Measurement
Python proficiency assessment	80% pass rate	Post-training assessment (technical quiz + coding exercise)
SAS-to-Python conversion rate	5 programs/analyst/month after 3 months	Track program conversions per analyst
Output validation accuracy	100% first-pass accuracy	No validation failures on production conversions
Analyst satisfaction	75% positive	Survey at 3 months and 6 months post-training
Time-to-productivity	80% of SAS productivity by Week 12	Track task completion rates

2. Dual-running validation methodology¶

2.1 Why dual-run¶

Dual-running means executing both the SAS program and the Python equivalent simultaneously for a validation period (typically 2--4 weeks per program) and comparing outputs. This is the only reliable way to prove that the migration preserves analytical correctness.

2.2 Validation levels¶

Level	Scope	When to use	Duration
Level 1: Summary	Compare aggregate statistics (means, sums, counts)	Low-risk reports, descriptive statistics	1 week
Level 2: Row-level sample	Compare 1,000 randomly sampled rows	Standard analytical programs	2 weeks
Level 3: Full row-level	Compare every row and every column	High-risk programs, regulatory outputs, model scoring	4 weeks
Level 4: Statistical equivalence	Formal statistical tests (paired t-test, equivalence test)	Production models, survey estimates	4+ weeks

2.3 Reconciliation framework¶

import pandas as pd
import numpy as np
from scipy import stats

def reconcile_outputs(sas_output, python_output, key_columns,
                      numeric_tolerance=0.001, report_path=None):
    """Compare SAS and Python outputs for validation.

    Args:
        sas_output: DataFrame from SAS (exported to CSV or Delta)
        python_output: DataFrame from Python
        key_columns: List of columns to join on
        numeric_tolerance: Maximum relative difference for numeric columns
        report_path: Optional path to save HTML report

    Returns:
        Dictionary with validation results
    """
    results = {'passed': True, 'details': []}

    # 1. Row count comparison
    sas_rows = len(sas_output)
    python_rows = len(python_output)
    row_match = sas_rows == python_rows
    results['details'].append({
        'check': 'Row count',
        'sas': sas_rows,
        'python': python_rows,
        'passed': row_match
    })
    if not row_match:
        results['passed'] = False

    # 2. Column comparison
    sas_cols = set(sas_output.columns)
    python_cols = set(python_output.columns)
    missing_in_python = sas_cols - python_cols
    extra_in_python = python_cols - sas_cols
    results['details'].append({
        'check': 'Column match',
        'missing_in_python': list(missing_in_python),
        'extra_in_python': list(extra_in_python),
        'passed': len(missing_in_python) == 0
    })

    # 3. Merge on key columns for row-level comparison
    common_cols = sas_cols & python_cols - set(key_columns)
    merged = pd.merge(sas_output, python_output, on=key_columns,
                       suffixes=('_sas', '_python'), how='outer',
                       indicator=True)

    # Check for unmatched rows
    sas_only = (merged['_merge'] == 'left_only').sum()
    python_only = (merged['_merge'] == 'right_only').sum()
    both = (merged['_merge'] == 'both').sum()
    results['details'].append({
        'check': 'Row matching',
        'matched': both,
        'sas_only': sas_only,
        'python_only': python_only,
        'passed': sas_only == 0 and python_only == 0
    })

    # 4. Numeric column comparison
    for col in common_cols:
        sas_col = f'{col}_sas'
        py_col = f'{col}_python'
        if sas_col not in merged.columns:
            continue

        if merged[sas_col].dtype in ['float64', 'int64', 'float32', 'int32']:
            matched = merged[merged['_merge'] == 'both']
            sas_vals = matched[sas_col].dropna()
            py_vals = matched[py_col].dropna()

            if len(sas_vals) == 0:
                continue

            mae = np.mean(np.abs(sas_vals.values - py_vals.values))
            max_diff = np.max(np.abs(sas_vals.values - py_vals.values))
            sas_mean = sas_vals.mean()
            rel_diff = mae / abs(sas_mean) if sas_mean != 0 else mae

            col_passed = rel_diff < numeric_tolerance
            results['details'].append({
                'check': f'Numeric: {col}',
                'sas_mean': round(sas_mean, 6),
                'python_mean': round(py_vals.mean(), 6),
                'mae': round(mae, 8),
                'max_diff': round(max_diff, 8),
                'relative_diff': round(rel_diff, 8),
                'passed': col_passed
            })
            if not col_passed:
                results['passed'] = False

    # Generate report
    if report_path:
        _generate_html_report(results, report_path)

    return results

2.4 Handling expected differences¶

Some differences between SAS and Python are expected and acceptable:

Difference type	Cause	Acceptable?	Resolution
Floating-point precision (1e-10)	Different LAPACK implementations	Yes	Within machine epsilon
Date formatting	SAS date values vs ISO dates	Yes	Standardize on ISO 8601
Missing value representation	SAS `.` vs Python `NaN` / `None`	Yes	Both represent missing; compare counts
Sort order of ties	Unstable sort algorithms	Yes	Add secondary sort keys if order matters
Random number sequences	Different PRNG implementations	Yes	Compare distributions, not individual values
Rounding at display level	SAS FORMAT vs Python round()	Yes	Compare raw values, not displayed values
Case sensitivity	SAS is case-insensitive	Maybe	Standardize case before comparison

3. Phased program migration strategy¶

3.1 Program classification¶

Before migrating, classify every SAS program:

Category	Description	Migration priority	Effort
Retire	Program no longer used or needed	Immediate	Zero (just decommission)
Trivial	Simple DATA Step + PROC MEANS/FREQ; no macros	High (quick wins)	XS--S
Standard	Moderate complexity; some macros; standard PROCs	Medium	S--M
Complex	Heavy macro usage; many PROCs; inter-program dependencies	Lower priority	M--L
Specialized	Survey procedures, clinical trial, optimization	Last or retain on SAS	L--XL
Regulatory	Output format required by regulation (FDA, Census)	Retain on SAS (initially)	N/A (keep)

3.2 Migration wave planning¶

Wave 1 (Weeks 1-6):    Retire programs + Trivial conversions
                        Target: 30-40% of program inventory
                        Risk: Very Low
                        Validation: Level 1 (summary comparison)

Wave 2 (Weeks 6-14):   Standard conversions (reporting + descriptive)
                        Target: 30-40% of program inventory
                        Risk: Low-Medium
                        Validation: Level 2 (row-level sample)

Wave 3 (Weeks 14-24):  Complex conversions (models + ETL)
                        Target: 15-20% of program inventory
                        Risk: Medium
                        Validation: Level 3 (full row-level)

Wave 4 (Weeks 24-36):  Specialized conversions
                        Target: 5-10% of program inventory
                        Risk: High
                        Validation: Level 4 (statistical equivalence)

Retain:                 Regulatory programs stay on SAS Viya (Azure)
                        Target: 5-10% of program inventory
                        Review annually for migration readiness

3.3 Program dependency analysis¶

Before migrating, map program dependencies:

# Example: analyze SAS program dependencies
# Look for: %include, libname references, proc append targets,
# data set references across programs

def analyze_sas_dependencies(sas_program_dir):
    """Analyze SAS program dependencies to determine migration order.

    Programs with no dependencies should be migrated first.
    Programs depended upon by others should be migrated before dependents.
    """
    import re
    from pathlib import Path

    programs = {}
    for sas_file in Path(sas_program_dir).glob('*.sas'):
        content = sas_file.read_text(encoding='latin1')
        deps = set()

        # Find %include references
        for match in re.finditer(r"%include\s+['\"]?(.+?)['\"]?\s*;", content, re.I):
            deps.add(match.group(1))

        # Find dataset references (simplified)
        for match in re.finditer(r"set\s+(\w+\.\w+)", content, re.I):
            deps.add(match.group(1))

        programs[sas_file.stem] = {
            'file': str(sas_file),
            'dependencies': deps,
            'lines': len(content.splitlines())
        }

    return programs

4. Output reconciliation framework¶

4.1 Automated reconciliation pipeline¶

Set up automated reconciliation as part of the migration CI/CD:

# .github/workflows/validate-migration.yml
name: SAS Migration Validation
on:
    pull_request:
        paths:
            - "models/migrated/**"

jobs:
    validate:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run Python migration
              run: |
                  cd models/migrated
                  python run_migration.py --program ${{ matrix.program }}

            - name: Compare outputs
              run: |
                  python scripts/reconcile.py \
                    --sas-output data/sas_baseline/${{ matrix.program }}.csv \
                    --python-output data/python_output/${{ matrix.program }}.csv \
                    --tolerance 0.001 \
                    --report reports/${{ matrix.program }}_validation.html

            - name: Upload validation report
              uses: actions/upload-artifact@v4
              with:
                  name: validation-${{ matrix.program }}
                  path: reports/

4.2 Reconciliation metrics dashboard¶

Track migration progress and validation results in Power BI:

Metric	Description	Target
Programs migrated	Count of programs converted to Python	Per wave targets
Programs validated	Count that passed dual-run validation	100% before decommission
Validation failures	Programs with output differences beyond tolerance	0 at cutover
Lines of SAS code migrated	Total SAS code lines converted	Track velocity
SAS licenses reduced	License count or cost reduction achieved	Per SAS contract terms
Analyst productivity	Tasks completed per analyst per week	>= 80% of SAS baseline by Week 12

5. CSA-in-a-Box as the unified analytics landing zone¶

5.1 Why CSA-in-a-Box for SAS migration¶

CSA-in-a-Box provides the complete Azure landing zone that a SAS migration requires:

Migration need	CSA-in-a-Box component	Why it matters
Where do Delta tables live?	ADLS Gen2 + OneLake + medallion architecture	Pre-configured bronze/silver/gold layers
How are datasets governed?	Purview + Unity Catalog + data-product contracts	Replaces SAS metadata server governance
How do Python notebooks run?	Fabric capacity + Databricks workspace	Managed compute with auto-scaling
How are models managed?	Azure ML workspace + MLflow	Replaces SAS Model Manager
How are reports delivered?	Power BI Premium + Direct Lake	Replaces SAS Visual Analytics
How is ETL orchestrated?	ADF pipelines + dbt project scaffolding	Replaces SAS Data Integration Studio
How is compliance maintained?	NIST 800-53, FedRAMP, CMMC, HIPAA YAMLs	Controls mapped and auditable in IaC
How is the platform deployed?	Bicep modules + `make deploy-dev`	Repeatable, version-controlled deployment

5.2 Deployment for SAS migration¶

# Step 1: Deploy the Data Management Landing Zone
make deploy-dmlz ENV=prod

# Step 2: Deploy a Data Landing Zone for the migrated analytics
make deploy-dlz ENV=prod DOMAIN=analytics

# Step 3: Provision Azure ML workspace
make deploy-ml ENV=prod

# Step 4: Configure Fabric capacity and workspace
# (Follow docs/QUICKSTART.md)

# Step 5: Deploy SAS Viya on AKS (if hybrid)
# (Follow tutorial-sas-viya-azure.md)

5.3 Folder structure for migrated programs¶

csa-inabox/
├── domains/
│   └── analytics/                    # Migrated SAS domain
│       ├── notebooks/
│       │   ├── survey_analysis.py    # Migrated SAS program
│       │   ├── claims_processing.py
│       │   └── model_scoring.py
│       ├── dbt/
│       │   ├── models/
│       │   │   ├── staging/          # SAS DATA Step cleaning
│       │   │   ├── intermediate/     # SAS multi-step logic
│       │   │   └── gold/             # SAS PROC SUMMARY outputs
│       │   ├── seeds/                # SAS PROC FORMAT lookups
│       │   └── macros/               # SAS macro equivalents
│       ├── pipelines/
│       │   └── adf/                  # SAS scheduling equivalents
│       └── data-products/
│           └── agency_summary/
│               └── contract.yaml     # Data product contract

6. Change management¶

6.1 Communication plan¶

Audience	Message	Frequency	Channel
Executive leadership	Migration progress, cost savings, risk status	Monthly	Executive dashboard + briefing
Analytics team	Training schedule, program migration timeline, support resources	Weekly	Team meetings + email
SAS programmers	Reskilling benefits, career growth, support commitment	Biweekly	1:1 meetings + team sessions
End users (report consumers)	Transition timeline, training on Power BI, no disruption commitment	Monthly	Email + lunch-and-learn
ISSO / Security	ATO status, compliance evidence, risk register	Monthly	Security review meetings

6.2 Resistance management¶

Resistance pattern	Root cause	Response
"SAS is better for statistics"	Comfort with familiar tool	Show side-by-side equivalence; acknowledge SAS strengths in niche areas
"Python is not validated"	Regulatory concern	Reference FDA R Consortium submission; demonstrate IQ/OQ/PQ for Python
"I'll be replaced"	Job security fear	Frame as upskilling; show SAS+Python analysts are more valuable
"Migration will break production"	Risk aversion	Demonstrate dual-running validation; point to rollback plan
"SAS has better support"	Vendor relationship comfort	Show Azure/Microsoft support model; community support advantages

7. Post-migration operations¶

7.1 Python code standards for former SAS teams¶

Establish coding standards that feel familiar to SAS programmers:

# Standard header for migrated programs
"""
Program: survey_analysis.py
Purpose: Quarterly employee survey engagement analysis
Migrated from: /sas/programs/survey/quarterly_analysis.sas
Original author: J. Smith (SAS)
Migration author: J. Smith (Python)
Migration date: 2026-05-15
Validation: Level 3 (full row-level), passed 2026-05-28
Schedule: Monthly, 1st business day (ADF trigger)

Change log:
  2026-05-15  Initial migration from SAS
  2026-06-01  Added Copilot integration for NLQ
"""

7.2 Monitoring migrated programs¶

# Standard monitoring pattern for migrated programs
import logging
import time
from datetime import datetime

logger = logging.getLogger(__name__)

def run_with_monitoring(program_name, func, *args, **kwargs):
    """Wrapper for migrated SAS programs with standard monitoring."""
    start_time = time.time()
    logger.info(f"Starting {program_name} at {datetime.now().isoformat()}")

    try:
        result = func(*args, **kwargs)
        elapsed = time.time() - start_time
        logger.info(f"Completed {program_name} in {elapsed:.1f}s")

        # Log to Azure Monitor
        # (replaces SAS log monitoring)
        return result

    except Exception as e:
        elapsed = time.time() - start_time
        logger.error(f"Failed {program_name} after {elapsed:.1f}s: {str(e)}")
        raise

8. Key success factors¶

Executive sponsorship. The CIO or CDO must own the migration decision and communicate it consistently
Reskilling investment. Budget 4--8 weeks of training per SAS programmer; this is the highest-ROI investment in the migration
Dual-running discipline. Every migrated program must pass validation before the SAS version is decommissioned
Incremental value delivery. Each migration wave should deliver measurable value (cost savings, new capability, performance improvement)
SAS retention for the right reasons. Keep SAS where there is a genuine technical or regulatory gap; do not keep SAS due to resistance or inertia
CSA-in-a-Box as the landing zone. Deploy the complete platform first; migrate into a well-governed environment, not ad-hoc Azure resources
Celebrate wins. Publicly recognize analysts who successfully convert complex SAS programs to Python
Measure and report. Track migration metrics weekly; share progress with leadership monthly

Maintainers: csa-inabox core team Last updated: 2026-04-30

Delivery method	Best for	Cost per person	Notes
Instructor-led (in-person)	Groups of 10--20	\(3K--\)5K	Highest engagement; allows real-time Q&A
Instructor-led (virtual)	Distributed teams	\(2K--\)4K	Effective if well-structured; record sessions
Self-paced online	Individual learners	\(500--\)1K	Coursera, DataCamp, LinkedIn Learning
Paired programming	Post-training reinforcement	Internal cost only	Pair SAS programmer with Python mentor
Hackathon / conversion sprint	Post-training application	Internal cost only	Convert real SAS programs in a team setting