Pipeline & Transform Migration: Foundry to Azure¶

A technical deep-dive for data engineers migrating Palantir Foundry pipelines, transforms, and scheduling to Azure Data Factory, dbt, and Microsoft Fabric.

1. Overview of Foundry pipeline architecture¶

Palantir Foundry organizes data transformation through a layered system that couples compute, orchestration, and governance into a single proprietary surface.

Foundry component	Purpose
Pipeline Builder	Visual drag-and-drop ETL designer with type-safe transforms, health checks, and LLM-assisted transform generation
Code Repositories	Web-based IDE for Python, Java, SQL, and Mesa transforms with Git branching, pull requests, and CI/CD
Python transforms	PySpark jobs wrapped in Foundry decorators (`@transform`, `@transform_df`, `@transform.using`) with automatic dependency tracking
SQL transforms	Declarative SQL with automatic input/output dataset binding
Java transforms	Full Spark Java API for complex or performance-critical workloads
Incremental computation	Process only changed rows or files since the last successful run
Streaming	Flink-based low-latency ingestion and transformation
Scheduling	Cron-based, dependency-based, and event-based triggers
Data Expectations	Assertion-based quality checks on pipeline outputs
Health Checks	Monitoring dashboards with alert routing to email, PagerDuty, and Slack
LLM transforms	Classification, sentiment analysis, summarization, entity extraction, and translation using integrated LLMs

The key architectural constraint is that every component above runs inside Foundry's proprietary runtime. Code, scheduling definitions, quality rules, and monitoring configuration are stored in Foundry-specific formats that cannot be exported to standard tooling without rewriting.

2. Azure pipeline architecture¶

The Azure equivalent separates concerns into purpose-built services, each best-in-class for its function.

flowchart LR
    subgraph Ingestion
        ADF[Azure Data Factory]
        EH[Event Hubs]
    end
    subgraph Storage
        ADLS[ADLS Gen2 / OneLake]
    end
    subgraph Transformation
        dbt[dbt Core / dbt Cloud]
        NB[Fabric / Databricks Notebooks]
        DBX[Databricks Spark Jobs]
    end
    subgraph Quality
        DBTT[dbt Tests]
        GE[Great Expectations]
        DC[Data Contracts]
    end
    subgraph Orchestration
        ADFT[ADF Triggers & Pipelines]
        GHA[GitHub Actions CI/CD]
    end
    subgraph Monitoring
        AM[Azure Monitor]
        AI[App Insights]
        WH[Webhook Alerts]
    end
    subgraph AI Enrichment
        AOI[Azure OpenAI]
        AIF[Azure AI Foundry]
    end

    ADF -->|Copy to Bronze| ADLS
    EH -->|Stream| ADLS
    ADLS --> dbt
    ADLS --> NB
    ADLS --> DBX
    dbt --> DBTT
    NB --> GE
    DBTT --> DC
    ADFT -->|Schedule| ADF
    ADFT -->|Trigger| dbt
    GHA -->|Deploy| dbt
    ADF -->|On failure| WH
    WH --> AM
    dbt --> AOI
    NB --> AIF

Why this is better: Each service can be replaced independently. dbt models are portable SQL. ADF pipelines are ARM/Bicep templates stored in Git. Notebooks run on any Spark-compatible engine. There is no single vendor whose removal would require rewriting everything.

3. Pipeline Builder to ADF migration¶

Foundry's Pipeline Builder provides a visual canvas for connecting data sources, applying transforms, and scheduling execution. Azure Data Factory provides an equivalent visual designer with broader connector support.

Foundry Pipeline Builder pattern¶

In Foundry, a typical ingestion pipeline is configured through the visual builder and produces a graph of dataset dependencies. Behind the scenes, it generates transform code and schedules. The pipeline is not exportable as a standard format.

Azure Data Factory equivalent¶

CSA-in-a-Box provides a reference ADF pipeline for batch ingestion at domains/shared/pipelines/adf/pl_ingest_to_bronze.json:

{
    "name": "pl_ingest_to_bronze",
    "properties": {
        "description": "Generic batch ingestion pipeline. Copies data from source to ADLS raw/bronze container with metadata logging.",
        "activities": [
            {
                "name": "SetIngestionTimestamp",
                "type": "SetVariable",
                "typeProperties": {
                    "variableName": "ingestionTimestamp",
                    "value": "@utcnow('yyyy-MM-ddTHH:mm:ssZ')"
                }
            },
            {
                "name": "CopyToRaw",
                "type": "Copy",
                "dependsOn": [
                    {
                        "activity": "SetIngestionTimestamp",
                        "dependencyConditions": ["Succeeded"]
                    }
                ],
                "typeProperties": {
                    "source": { "type": "DelimitedTextSource" },
                    "sink": {
                        "type": "ParquetSink",
                        "storeSettings": { "type": "AzureBlobFSWriteSettings" }
                    }
                }
            }
        ],
        "parameters": {
            "sourceContainer": {
                "type": "String",
                "defaultValue": "source-data"
            },
            "sourceFolderPath": { "type": "String" },
            "domainName": { "type": "String", "defaultValue": "shared" },
            "entityName": { "type": "String" }
        }
    }
}

The master orchestration pipeline (pl_medallion_orchestration) chains ingestion with dbt transforms through the full Bronze, Silver, and Gold medallion sequence:

ForEachEntity_IngestToBronze (parallel, batch=4)
    --> RunDbt_Bronze (tag:bronze)
        --> RunDbt_Silver (tag:silver)
            --> RunDbt_Gold (tag:gold)
                --> OnFailure: WebActivity alert via webhook

CSA-in-a-Box evidence: domains/shared/pipelines/adf/pl_medallion_orchestration.json

Migration checklist¶

Foundry Pipeline Builder feature	ADF equivalent	Notes
Drag-and-drop canvas	ADF visual authoring	Near-identical experience
Source connectors	100+ ADF connectors	Broader connector coverage
Transform steps	ADF Data Flows / dbt models	dbt preferred for SQL transforms
Dependency graph	ADF activity dependencies	`dependencyConditions` in JSON
Parameterization	ADF pipeline parameters	Expression language: `@pipeline().parameters.x`
Error handling	ADF failure dependencies + WebActivity	Webhook alerts on failure
Version control	ADF Git integration (Azure Repos or GitHub)	ARM templates committed to repo

4. Python transform migration¶

Foundry Python transforms use PySpark wrapped in proprietary decorators that handle input/output dataset binding, dependency tracking, and incremental computation. Migrating requires removing the decorator layer and replacing it with dbt models, Fabric notebooks, or Databricks jobs.

Before: Foundry Python transform¶

# Foundry Code Repository - Python transform
from transforms.api import transform_df, Input, Output

@transform_df(
    Output("/Company/datasets/silver/customers_cleaned"),
    source=Input("/Company/datasets/bronze/raw_customers"),
)
def compute(source):
    from pyspark.sql import functions as F

    return (
        source
        .dropDuplicates(["customer_id"])
        .withColumn("first_name", F.upper(F.trim(F.col("first_name"))))
        .withColumn("last_name", F.upper(F.trim(F.col("last_name"))))
        .withColumn("email", F.lower(F.trim(F.col("email"))))
        .withColumn("country_code", F.upper(F.coalesce(F.col("country"), F.lit("US"))))
        .filter(F.col("customer_id").isNotNull())
    )

After: dbt SQL model (recommended path)¶

The same logic expressed as a dbt incremental model. This is the CSA-in-a-Box standard pattern from domains/shared/dbt/models/silver/slv_customers.sql:

{{
    config(
        materialized='incremental',
        unique_key='customer_sk',
        incremental_strategy='merge',
        file_format='delta',
        tags=['silver', 'customers'],
        on_schema_change='fail'
    )
}}

with bronze as (
    select * from {{ ref('brz_customers') }}
    {% if is_incremental() %}
    where _dbt_loaded_at > (select max(_dbt_loaded_at) from {{ this }})
    {% endif %}
),

deduped as (
    select
        *,
        row_number() over (
            partition by customer_id
            order by _source_modified_at desc, _dbt_loaded_at desc
        ) as _row_num
    from bronze
),

cleaned as (
    select
        {{ dbt_utils.generate_surrogate_key(['customer_id']) }} as customer_sk,
        cast(customer_id as string) as customer_id,
        trim(upper(coalesce(first_name, ''))) as first_name,
        trim(upper(coalesce(last_name, ''))) as last_name,
        trim(lower(coalesce(email, ''))) as email,
        trim(upper(coalesce(country, 'US'))) as country_code,
        now() as _dbt_loaded_at,
        '{{ invocation_id }}' as _dbt_run_id
    from deduped
    where _row_num = 1
)

select * from cleaned

After: Fabric/Databricks notebook (when PySpark is required)¶

For transforms that genuinely require Python (ML feature engineering, complex UDFs, API calls), use a Fabric or Databricks notebook:

# Fabric / Databricks notebook - no proprietary decorators
from pyspark.sql import functions as F

# Read from Delta Lake directly - no Foundry Input() wrapper
bronze_df = spark.read.format("delta").table("bronze.raw_customers")

cleaned = (
    bronze_df
    .dropDuplicates(["customer_id"])
    .withColumn("first_name", F.upper(F.trim(F.col("first_name"))))
    .withColumn("last_name", F.upper(F.trim(F.col("last_name"))))
    .withColumn("email", F.lower(F.trim(F.col("email"))))
    .withColumn("country_code", F.upper(F.coalesce(F.col("country"), F.lit("US"))))
    .filter(F.col("customer_id").isNotNull())
    .withColumn("_processed_at", F.current_timestamp())
)

# Write to Delta Lake directly - no Foundry Output() wrapper
cleaned.write.format("delta").mode("overwrite").saveAsTable("silver.customers_cleaned")

Key differences¶

Foundry pattern	Azure pattern	Impact
`@transform_df` decorator	dbt `config()` block or plain Spark	Remove all Foundry imports
`Input("/path/to/dataset")`	`{{ ref('model_name') }}` or `spark.read.table()`	Dataset paths become dbt refs or catalog tables
`Output("/path/to/dataset")`	dbt materializes automatically or `df.write.saveAsTable()`	No explicit output declaration needed
`@transform.using(param=...)`	dbt `{{ var('param') }}` or notebook widgets	Parameters externalized
Foundry type system	Delta Lake schema enforcement	Schema managed by Delta + dbt `on_schema_change`

5. SQL transform migration to dbt¶

Foundry SQL transforms map directly to dbt models. The migration is primarily syntactic.

Before: Foundry SQL transform¶

-- Foundry SQL Transform (Code Repository)
-- Inputs and outputs declared in the Foundry UI
SELECT
    order_id,
    customer_id,
    order_date,
    SUM(line_total) AS order_total,
    COUNT(*) AS line_count
FROM @input('bronze_orders')
GROUP BY order_id, customer_id, order_date

After: dbt SQL model¶

-- domains/shared/dbt/models/silver/slv_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_sk',
        incremental_strategy='merge',
        file_format='delta',
        tags=['silver', 'orders']
    )
}}

with source as (
    select * from {{ ref('brz_orders') }}
    {% if is_incremental() %}
    where _dbt_loaded_at > (select max(_dbt_loaded_at) from {{ this }})
    {% endif %}
),

aggregated as (
    select
        {{ dbt_utils.generate_surrogate_key(['order_id']) }} as order_sk,
        order_id,
        customer_id,
        order_date,
        sum(line_total) as order_total,
        count(*) as line_count,
        now() as _dbt_loaded_at
    from source
    group by order_id, customer_id, order_date
)

select * from aggregated

Translation rules¶

Foundry SQL syntax	dbt equivalent
`@input('dataset_name')`	`{{ ref('model_name') }}`
`@output('dataset_name')`	File name is the model name
Foundry-managed schema	`schema.yml` with column definitions and tests
Inline quality checks	dbt tests in `schema.yml` or `tests/` directory
Dataset-level scheduling	Handled by ADF triggers, not in the SQL

6. Incremental pipeline patterns¶

Foundry's incremental computation is one of its most valuable features. It processes only changed rows or files since the last successful run. dbt provides equivalent incremental functionality through its incremental materialization strategy.

Before: Foundry incremental Python transform¶

from transforms.api import transform, Input, Output, incremental

@incremental()
@transform(
    output=Output("/datasets/silver/transactions"),
    source=Input("/datasets/bronze/raw_transactions"),
)
def compute(source, output):
    # Foundry automatically tracks which rows/files are new
    new_rows = source.dataframe()
    output.write_dataframe(new_rows)

After: dbt incremental model¶

The CSA-in-a-Box project configures incremental models at the project level in dbt_project.yml:

models:
    csa_analytics:
        bronze:
            +materialized: incremental
            +file_format: delta
            +schema: bronze
        silver:
            +materialized: incremental
            +file_format: delta
            +schema: silver
            +incremental_strategy: merge
        gold:
            +materialized: table # Gold is full rebuild by default
            +file_format: delta
            +schema: gold

Individual models control incremental behavior:

-- Incremental merge pattern (CSA-in-a-Box standard)
{{
    config(
        materialized='incremental',
        unique_key='_surrogate_key',
        incremental_strategy='merge',
        file_format='delta',
        on_schema_change='fail'
    )
}}

with source as (
    select * from {{ source('raw_data', 'sample_customers') }}
    {% if is_incremental() %}
    {{ incremental_file_filter(this) }}
    {% endif %}
),

staged as (
    select
        {{ dbt_utils.generate_surrogate_key(['customer_id']) }} as _surrogate_key,
        *,
        now() as _dbt_loaded_at,
        {{ source_file_path_from_metadata() }} as _source_file,
        {{ source_file_modification_time() }} as _source_modified_at,
        '{{ invocation_id }}' as _dbt_run_id
    from source
)

select * from staged

CSA-in-a-Box evidence: domains/shared/dbt/models/bronze/brz_customers.sql

ADF incremental copy (for ingestion)¶

For source-system ingestion, ADF provides watermark-based and change-data-capture patterns:

{
    "name": "IncrementalCopy",
    "type": "Copy",
    "typeProperties": {
        "source": {
            "type": "SqlSource",
            "sqlReaderQuery": {
                "value": "SELECT * FROM transactions WHERE modified_date > '@{pipeline().parameters.lastWatermark}'",
                "type": "Expression"
            }
        }
    }
}

Incremental strategy comparison¶

Pattern	Foundry	dbt on Azure
Row-level	`@incremental()` decorator, automatic row tracking	`is_incremental()` with timestamp watermark
File-level	Automatic new-file detection	`incremental_file_filter` macro (CSA-in-a-Box)
Merge (upsert)	Implicit on primary key	`incremental_strategy='merge'` + `unique_key`
Append-only	Output mode configuration	`incremental_strategy='append'`
Full refresh	Manual trigger in UI	`dbt run --full-refresh` or ADF parameter
Schema evolution	Handled by Foundry runtime	`on_schema_change='fail'` / `'append_new_columns'`

7. Streaming migration approaches¶

Foundry uses Apache Flink for low-latency streaming pipelines. Azure provides multiple streaming options depending on latency requirements and complexity.

Option A: Event Hubs + Stream Analytics (managed, lowest effort)¶

Best for: simple transformations, windowed aggregations, real-time dashboards.

-- Azure Stream Analytics query
SELECT
    System.Timestamp() AS window_end,
    device_id,
    AVG(temperature) AS avg_temp,
    MAX(temperature) AS max_temp,
    COUNT(*) AS reading_count
INTO [output-power-bi]
FROM [input-event-hub] TIMESTAMP BY event_time
GROUP BY
    device_id,
    TumblingWindow(minute, 5)

Option B: Spark Structured Streaming (Databricks/Fabric)¶

Best for: complex transformations, ML scoring, multi-source joins.

# Databricks or Fabric notebook
from pyspark.sql import functions as F

stream = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", broker)
    .option("subscribe", "raw-events")
    .load()
    .select(F.from_json(F.col("value").cast("string"), schema).alias("data"))
    .select("data.*")
)

# Write to Delta Lake with checkpoint for exactly-once semantics
(stream
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/events")
    .toTable("silver.events_streaming")
)

Option C: Fabric Real-Time Intelligence¶

Best for: organizations standardizing on Microsoft Fabric, Kusto-based analytics.

// Fabric Eventhouse KQL query
RawEvents
| where ingestion_time() > ago(5m)
| summarize avg_temp = avg(temperature),
            max_temp = max(temperature),
            count = count()
  by device_id, bin(event_time, 5m)
| where avg_temp > 80.0

Streaming migration decision matrix¶

Requirement	Recommended Azure service
Simple aggregations, < 5 transforms	Stream Analytics
Complex joins, ML scoring, Python logic	Spark Structured Streaming
Sub-second latency, KQL analytics	Fabric Real-Time Intelligence
IoT device management + streaming	IoT Hub + Stream Analytics
Message ordering guarantees	Event Hubs with partition keys

8. Scheduling migration¶

Foundry supports cron-based, dependency-based, and event-based scheduling. ADF provides direct equivalents through its trigger system.

Before: Foundry scheduling¶

Foundry schedules are configured through the UI or API:

Cron schedules: Run at fixed intervals (e.g., daily at 06:00 UTC)
Dependency triggers: Run when upstream datasets are updated
Event triggers: Run when external events occur (e.g., file arrival)

After: ADF trigger types¶

CSA-in-a-Box provides reference triggers at domains/shared/pipelines/adf/triggers/.

Schedule trigger (replaces Foundry cron schedules):

{
    "name": "tr_daily_medallion",
    "properties": {
        "type": "ScheduleTrigger",
        "typeProperties": {
            "recurrence": {
                "frequency": "Day",
                "interval": 1,
                "startTime": "2026-01-01T06:00:00Z",
                "timeZone": "UTC",
                "schedule": {
                    "hours": [6],
                    "minutes": [0]
                }
            }
        },
        "pipelines": [
            {
                "pipelineReference": {
                    "referenceName": "pl_medallion_orchestration",
                    "type": "PipelineReference"
                },
                "parameters": {
                    "environment": "prod",
                    "fullRefresh": false
                }
            }
        ]
    }
}

CSA-in-a-Box evidence: domains/shared/pipelines/adf/triggers/tr_daily_medallion.json

Tumbling window trigger (replaces Foundry dependency-based scheduling):

{
    "name": "tr_hourly_ingest",
    "properties": {
        "type": "TumblingWindowTrigger",
        "typeProperties": {
            "frequency": "Hour",
            "interval": 1,
            "startTime": "2026-01-01T00:00:00Z",
            "retryPolicy": { "count": 3, "intervalInSeconds": 30 },
            "dependsOn": []
        }
    }
}

Event-based trigger (replaces Foundry event triggers):

{
    "name": "tr_file_arrival",
    "properties": {
        "type": "BlobEventsTrigger",
        "typeProperties": {
            "blobPathBeginsWith": "/landing/sales/",
            "blobPathEndsWith": ".csv",
            "events": ["Microsoft.Storage.BlobCreated"],
            "scope": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{sa}"
        }
    }
}

Schedule mapping reference¶

Foundry schedule type	ADF trigger type	Key configuration
Cron (fixed interval)	`ScheduleTrigger`	`recurrence.frequency`, `schedule.hours`
Dependency (upstream complete)	`TumblingWindowTrigger` with `dependsOn`	Chain triggers with dependency references
Event (file arrival)	`BlobEventsTrigger`	`blobPathBeginsWith`, `events`
Event (message arrival)	`CustomEventsTrigger`	Event Grid topic subscription
Manual	ADF API / `workflow_dispatch` in GitHub Actions	REST API or GitHub Actions manual trigger

9. Data quality migration¶

Foundry Data Expectations allow pipeline authors to define assertions on output datasets. Azure provides richer alternatives through dbt tests, Great Expectations, and CSA-in-a-Box data contracts.

Before: Foundry Data Expectations¶

# Foundry Data Expectations
from transforms.api import transform_df, Input, Output
from transforms.expectations import expect_column_values_to_not_be_null
from transforms.expectations import expect_column_values_to_be_unique

@expect_column_values_to_not_be_null("customer_id")
@expect_column_values_to_be_unique("customer_id")
@transform_df(
    Output("/datasets/silver/customers"),
    source=Input("/datasets/bronze/raw_customers"),
)
def compute(source):
    return source.dropDuplicates(["customer_id"])

After: dbt tests in schema.yml¶

CSA-in-a-Box defines quality tests in domains/shared/dbt/models/silver/schema.yml:

version: 2

models:
    - name: slv_customers
      description: >
          Silver layer: Cleansed customer records. Deduped, standardized
          casing, validated email format, surrogate-keyed.
      columns:
          - name: customer_sk
            description: Surrogate key (md5 of customer_id).
            tests:
                - unique
                - not_null
          - name: customer_id
            description: Natural customer identifier from Bronze.
          - name: email
            description: Customer email (lowercased)
          - name: is_valid
            description: True when the row passes every quality check.
            tests:
                - not_null
          - name: category
            tests:
                - accepted_values:
                      values:
                          ["ELECTRONICS", "CLOTHING", "HOME", "BOOKS", "SPORTS"]
                      severity: warn

After: CSA-in-a-Box data contracts¶

For cross-domain quality enforcement, CSA-in-a-Box uses data product contracts (domains/finance/data-products/invoices/contract.yaml):

apiVersion: csa.microsoft.com/v1
kind: DataProductContract

metadata:
    name: finance.invoices
    domain: finance
    owner: finance-data-engineering@contoso.com
    version: "1.0.0"

schema:
    primary_key: [invoice_sk]
    columns:
        - name: invoice_sk
          type: string
          nullable: false
        - name: status
          type: string
          nullable: false
          allowed_values: [DRAFT, SENT, PAID, OVERDUE, CANCELLED, PARTIAL]

sla:
    freshness_minutes: 60
    valid_row_ratio: 0.97

quality_rules:
    - rule: expect_column_values_to_not_be_null
      column: invoice_sk
    - rule: expect_column_values_to_be_unique
      column: invoice_sk
    - rule: expect_column_values_to_be_in_set
      column: status
      value_set: [DRAFT, SENT, PAID, OVERDUE, CANCELLED, PARTIAL]
    - rule: expect_column_values_to_be_between
      column: total_amount
      min_value: 0
      mostly: 0.97

Contracts are validated in CI via .github/workflows/validate-contracts.yml.

After: Inline validation flags (CSA-in-a-Box pattern)¶

Rather than filtering bad data, Silver models flag invalid records so Gold models can decide:

-- From slv_customers.sql - validation pattern
validated as (
    select
        *,
        case when customer_id is null or customer_id = ''
             then true else false end as _is_missing_id,
        {{ flag_invalid_email('email') }} as _is_invalid_email,
        case when first_name = '' and last_name = ''
             then true else false end as _is_missing_name
    from cleaned
)

select
    *,
    not (_is_missing_id or _is_invalid_email or _is_missing_name) as is_valid,
    concat_ws('; ',
        case when _is_missing_id then 'customer_id missing' end,
        case when _is_invalid_email then 'email failed regex validation' end,
        case when _is_missing_name then 'first_name and last_name both empty' end
    ) as validation_errors
from validated

Quality migration comparison¶

Foundry Expectations	Azure equivalent	Advantage
`expect_column_values_to_not_be_null`	dbt `not_null` test	Identical semantics
`expect_column_values_to_be_unique`	dbt `unique` test	Identical semantics
`expect_column_values_to_be_in_set`	dbt `accepted_values` test	Supports `severity: warn`
Custom expectations	dbt singular tests or `dbt_expectations` package	Richer test library
Pipeline-level health checks	dbt source freshness + Azure Monitor alerts	Broader alerting ecosystem
Output assertions	Data contracts with CI validation	Enforced at build time, not just runtime

10. LLM enrichment pipelines on Azure¶

Foundry's LLM pipeline transforms provide built-in classification, sentiment analysis, summarization, entity extraction, and translation. On Azure, these capabilities are delivered through Azure OpenAI and Azure AI Foundry, integrated into data pipelines via ADF custom activities or notebooks.

Before: Foundry LLM transform¶

# Foundry LLM Transform (AIP-enabled pipeline)
from transforms.api import transform_df, Input, Output
from palantir_models.transforms import LlmTransform

@transform_df(
    Output("/datasets/enriched/classified_tickets"),
    source=Input("/datasets/silver/support_tickets"),
)
def compute(source):
    llm = LlmTransform(model="palantir-llm-v2")
    return llm.classify(
        source,
        input_col="ticket_text",
        output_col="category",
        labels=["billing", "technical", "account", "general"],
    )

After: Azure OpenAI in a Fabric/Databricks notebook¶

# Fabric or Databricks notebook with Azure OpenAI
import openai
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

# Configure Azure OpenAI
client = openai.AzureOpenAI(
    azure_endpoint="https://<resource>.openai.azure.com/",
    api_version="2024-06-01",
    azure_ad_token_provider=get_bearer_token  # Managed Identity
)

def classify_ticket(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Classify the support ticket into exactly one category: "
                "billing, technical, account, general. "
                "Respond with only the category name."
            )},
            {"role": "user", "content": text}
        ],
        max_tokens=10,
        temperature=0.0
    )
    return response.choices[0].message.content.strip().lower()

# Register as Spark UDF
classify_udf = F.udf(classify_ticket, StringType())

# Read source data
tickets = spark.read.format("delta").table("silver.support_tickets")

# Apply classification with rate limiting via repartitioning
classified = (
    tickets
    .repartition(10)  # Control parallelism for API rate limits
    .withColumn("category", classify_udf(F.col("ticket_text")))
    .withColumn("_classified_at", F.current_timestamp())
)

classified.write.format("delta").mode("overwrite").saveAsTable("enriched.classified_tickets")

After: ADF pipeline with Azure OpenAI custom activity¶

For batch LLM enrichment orchestrated through ADF:

{
    "name": "LLM_Classify_Tickets",
    "type": "DatabricksNotebook",
    "typeProperties": {
        "notebookPath": "/notebooks/enrich/classify_support_tickets",
        "baseParameters": {
            "source_table": "silver.support_tickets",
            "target_table": "enriched.classified_tickets",
            "model_deployment": "gpt-4o",
            "batch_size": "100"
        }
    },
    "linkedServiceName": {
        "referenceName": "ls_databricks",
        "type": "LinkedServiceReference"
    }
}

LLM use-case mapping¶

Foundry LLM transform	Azure implementation	Service
Classification	Azure OpenAI `gpt-4o` with structured output	Azure OpenAI
Sentiment analysis	Azure AI Language or OpenAI	AI Language / OpenAI
Summarization	Azure OpenAI `gpt-4o`	Azure OpenAI
Entity extraction	Azure AI Language NER or OpenAI	AI Language / OpenAI
Translation	Azure AI Translator or OpenAI	AI Translator / OpenAI
Embedding generation	Azure OpenAI `text-embedding-3-large`	Azure OpenAI
RAG enrichment	Azure AI Search + OpenAI	AI Foundry

11. CI/CD for pipelines¶

Foundry Code Repositories provide a built-in web IDE with Git branching, pull requests, and CI checks. On Azure, pipeline and transform code lives in standard Git repositories with GitHub Actions (or Azure DevOps Pipelines) for CI/CD.

CSA-in-a-Box CI/CD architecture¶

The reference CI/CD workflow deploys dbt models to Databricks across multiple verticals. From .github/workflows/deploy-dbt.yml:

name: Deploy dbt Models

on:
    push:
        branches: [main]
        paths:
            - "domains/*/dbt/**"
    workflow_dispatch:
        inputs:
            vertical:
                description: 'Vertical to deploy (or "all")'
                required: true
                type: choice
                options: [all, usda, dot, usps, noaa, epa, commerce, interior]
            environment:
                description: "Target environment"
                type: choice
                options: [dev, staging, prod]
            full_refresh:
                description: "Run full refresh (drop and recreate)"
                type: boolean
                default: false

jobs:
    deploy:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4
            - uses: actions/setup-python@v5
              with:
                  python-version: "3.12"
            - run: pip install dbt-databricks==1.8.* dbt-core==1.8.*
            - run: dbt deps
            - run: dbt seed --target ${{ inputs.environment }}
            - run: dbt run --target ${{ inputs.environment }}
            - run: dbt test --target ${{ inputs.environment }}
            - uses: actions/upload-artifact@v4
              with:
                  name: dbt-artifacts
                  path: |
                      target/manifest.json
                      target/run_results.json

Data contract validation in CI¶

Quality gates enforce contracts before code reaches production:

# .github/workflows/validate-contracts.yml (simplified)
name: Validate Data Contracts

on:
    pull_request:
        paths:
            - "domains/*/data-products/*/contract.yaml"

jobs:
    validate:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4
            - run: pip install pyyaml jsonschema
            - run: python -m governance.contracts.contract_validator

CI/CD comparison¶

Foundry CI/CD feature	Azure equivalent	CSA-in-a-Box evidence
Code Repository (web IDE)	GitHub + VS Code / Codespaces	Standard Git workflow
Branch-based development	Git branches + pull requests	GitHub flow
CI checks on PR	GitHub Actions	`.github/workflows/deploy-dbt.yml`
Automated testing	`dbt test` in CI pipeline	Test step in deploy workflow
Environment promotion	Workflow dispatch with environment selector	`dev` / `staging` / `prod` targets
Artifact tracking	GitHub Actions artifacts	`manifest.json`, `run_results.json`
Contract validation	CI contract checks	`.github/workflows/validate-contracts.yml`

12. Performance comparison¶

The following benchmarks are representative of typical mid-sized federal deployments (20 TB, 50-100 pipelines, 500 analytic users). Actual performance varies by workload, data volume, and SKU selection.

ETL throughput¶

Workload	Foundry (typical)	Azure (dbt + Databricks)	Notes
10 GB CSV ingest to Parquet	8-12 min	3-5 min	ADF parallel copy + ADLS Gen2
1 TB incremental merge	15-25 min	8-15 min	dbt merge on Delta Lake, Photon-accelerated
100 SQL transforms (Silver)	20-30 min	10-20 min	dbt parallel execution (`threads: 8`)
Full medallion pipeline (20 TB)	2-4 hours	1-2 hours	ADF orchestration + dbt + Databricks

Cost efficiency¶

Metric	Foundry	Azure
Compute for 8-hour daily pipeline window	Included in license (opaque)	~$150-300/day (Databricks Jobs Compute)
Additional per-seat cost for pipeline authors	$15,000-25,000/seat/year	$0 (GitHub seats, ~$4/user/month)
Storage (20 TB Delta Lake)	Included in license (opaque)	~$400/month (ADLS Gen2 Hot)
Monitoring and alerting	Included	~$50-100/month (Azure Monitor)

Operational advantages¶

Dimension	Foundry	Azure
Debugging	Foundry-specific logs, limited to web UI	Spark UI, ADF Monitor, Log Analytics, local dbt debug
Local development	Not supported (cloud-only IDE)	Full local dev with dbt + DuckDB or local Spark
Testing	Foundry-specific test framework	dbt tests, pytest, Great Expectations (industry standard)
Talent availability	Small, specialized pool	Large pool (dbt, Spark, SQL, Python are commodity skills)
Vendor portability	Locked to Foundry runtime	dbt runs on Databricks, Snowflake, BigQuery, Fabric, Postgres

13. Common pitfalls¶

Pitfall 1: Attempting a 1:1 port of Foundry transforms¶

Problem: Teams try to replicate every Foundry Python transform as an identical PySpark notebook. This preserves the most complex, hardest-to-maintain pattern.

Solution: Audit each transform. If the logic is expressible in SQL (and 80%+ of transforms are), convert to a dbt model. Reserve notebooks for genuine Python requirements: ML features, API integrations, complex UDFs.

Pitfall 2: Ignoring dbt incremental strategies¶

Problem: Teams materialize everything as table (full rebuild), causing pipeline runtimes to grow linearly with data volume.

Solution: Use materialized='incremental' with incremental_strategy='merge' for Silver and Bronze models. Use timestamp watermarks (_dbt_loaded_at) to process only new data. CSA-in-a-Box configures this at the project level in dbt_project.yml.

Pitfall 3: Replicating Foundry's monolithic scheduling¶

Problem: Foundry's dependency-based scheduling rebuilds the entire DAG on any upstream change. Teams replicate this with a single ADF trigger that runs everything.

Solution: Use ADF's composable trigger system. Schedule ingestion hourly (tr_hourly_ingest), run the full medallion pipeline daily (tr_daily_medallion), and use event-based triggers for latency-sensitive sources. Let dbt's is_incremental() logic handle partial runs efficiently.

Pitfall 4: Not externalizing data quality¶

Problem: Teams embed quality checks inside transform code (Python assertions, try/except blocks) instead of using declarative quality frameworks.

Solution: Use dbt tests in schema.yml for column-level assertions. Use data contracts (contract.yaml) for cross-domain SLAs. Use inline validation flags (is_valid, validation_errors) instead of silently dropping bad records. This approach is auditable, version-controlled, and enforceable in CI.

Pitfall 5: Underestimating streaming migration complexity¶

Problem: Teams assume Foundry Flink pipelines can be directly ported to Spark Structured Streaming.

Solution: Evaluate latency requirements honestly. Many "streaming" pipelines in Foundry are actually micro-batch with 5-15 minute windows. If true sub-second latency is not required, use ADF scheduled triggers with short intervals or dbt incremental models. If sub-second latency is required, evaluate Stream Analytics (simplest), Spark Structured Streaming (most flexible), or Fabric Real-Time Intelligence (best for KQL-centric teams).

Pitfall 6: Skipping the local development workflow¶

Problem: Teams develop dbt models only in cloud environments, losing the fast iteration loop.

Solution: Use dbt with DuckDB locally for development and testing. The CSA-in-a-Box dbt project supports DuckDB as a development target (note the if target.type != 'duckdb' conditionals in model configs). Run dbt build locally in seconds, then deploy to Databricks via CI/CD.

Pitfall 7: Not mapping Foundry Health Checks to Azure Monitor¶

Problem: After migration, teams lose visibility into pipeline health because they did not set up equivalent alerting.

Solution: Configure Azure Monitor alerts for ADF pipeline failures. Use dbt source freshness tests for data-arrival monitoring. Set up webhook activities in ADF for failure notifications (see OnPipelineFailure in pl_medallion_orchestration). Route alerts to Teams, Slack, PagerDuty, or email through Azure Monitor action groups.

Summary: migration path by component¶

Foundry component	Primary Azure target	Secondary option	CSA-in-a-Box evidence
Pipeline Builder	ADF visual designer	Fabric Data Factory	`domains/shared/pipelines/adf/`
Code Repositories	GitHub + VS Code	Azure Repos	`.github/workflows/`
Python transforms	dbt SQL models	Fabric/Databricks notebooks	`domains/shared/dbt/models/`
SQL transforms	dbt SQL models	Fabric SQL	`domains/shared/dbt/models/`
Java transforms	Databricks Spark jobs	AKS Spark Submit	N/A (use notebooks)
Incremental computation	dbt incremental models	ADF incremental copy	`dbt_project.yml`
Streaming	Event Hubs + Stream Analytics	Spark Structured Streaming	`domains/shared/notebooks/`
Scheduling	ADF triggers	GitHub Actions `workflow_dispatch`	`domains/shared/pipelines/adf/triggers/`
Data Expectations	dbt tests + data contracts	Great Expectations	`domains/shared/dbt/models/*/schema.yml`
Health Checks	Azure Monitor + ADF webhook alerts	dbt source freshness	`pl_medallion_orchestration.json`
LLM transforms	Azure OpenAI + notebooks	Azure AI Foundry	`domains/shared/notebooks/`

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team See also: Data Integration Migration | DevOps Migration | AI & AIP Migration | Migration Playbook