Pattern — Streaming & CDC¶
TL;DR: Event Hubs for ingestion (Kafka-compatible), Stream Analytics or Fabric RTI for stateful streaming SQL, Databricks Structured Streaming for complex stateful with ML, Eventhouse / ADX for sub-second time-series queries. Synapse Link for Cosmos / SQL CDC; Debezium on Container Apps for other DBs.
Problem¶
"Streaming" covers wildly different needs: sub-second IoT telemetry, near-real-time dashboards, CDC from operational DBs, event-driven microservices, fraud scoring, log enrichment. No single Azure service is best at all of these. Picking the wrong combination produces years of regret.
Decision tree¶
flowchart TD
Start[I have streaming data] --> Q1{Source?}
Q1 -->|IoT devices, sensors| EH1[Event Hubs<br/>+ Capture for bronze]
Q1 -->|Kafka existing| EH2[Event Hubs Kafka surface<br/>or migrate Kafka clients]
Q1 -->|CDC from operational DB| Q2{Which DB?}
Q2 -->|Cosmos DB| Cosmos[Cosmos change feed<br/>+ Function trigger]
Q2 -->|Azure SQL / Synapse SQL| SLink[Synapse Link<br/>HTAP analytical store]
Q2 -->|Postgres / MySQL / SQL Server| Debezium[Debezium<br/>on Container Apps<br/>→ Event Hubs]
Q2 -->|Salesforce / SAP / etc.| ADF[ADF CDC<br/>+ Synapse Link for SAP]
EH1 --> Q3{Processing need?}
EH2 --> Q3
Cosmos --> Q3
SLink --> Q3
Debezium --> Q3
Q3 -->|Simple stateless transforms| Func[Functions / Container Apps<br/>KEDA-scaled]
Q3 -->|Stateful SQL windows| ASA[Stream Analytics<br/>or Fabric RTI Eventstream]
Q3 -->|Complex stateful + ML| DBR[Databricks Structured Streaming<br/>or DLT]
Q3 -->|Sub-second time-series queries| ADX[Fabric Eventhouse<br/>or Azure Data Explorer]
Func --> Q4{Sink?}
ASA --> Q4
DBR --> Q4
ADX --> Q4
Q4 -->|Lakehouse Delta| ADLS[(ADLS Delta)]
Q4 -->|Real-time dashboard| PBI[Power BI Direct Lake or DirectQuery to ADX]
Q4 -->|Operational write-back| Cosmos2[Cosmos / SQL]
Q4 -->|Downstream events| EH3[Another Event Hub topic] Pattern: ingestion choice¶
| Source | Recommended ingestion |
|---|---|
| Few high-rate streams | Event Hubs Standard / Premium + Capture |
| Massive multi-tenant (>1M msg/s) | Event Hubs Dedicated |
| Existing Kafka clients | Event Hubs Kafka surface (no client changes) |
| MQTT IoT devices | IoT Hub → Event Hubs |
| HTTP webhook | Event Grid or Functions HTTP trigger |
| File drops (S3, GCS, SFTP) | ADF copy or Logic Apps — not really streaming, but often confused for it |
Pattern: processing choice¶
| Need | Service |
|---|---|
| Stateless filter / enrich / route | Functions (HTTP/Event trigger) or Container Apps (KEDA) |
| Tumbling / hopping window aggregations in SQL | Stream Analytics or Fabric Eventstream |
| Stateful joins, late-data, watermarks, ML | Databricks Structured Streaming or Delta Live Tables |
| Sub-second time-series queries | Fabric Eventhouse / Azure Data Explorer (querying side, not processing) |
| Complex event processing (CEP) | Stream Analytics with referenced data, or Flink (Confluent Cloud on Azure) |
Pattern: CDC from operational databases¶
Cosmos DB — change feed (built-in, free)¶
# Functions Cosmos trigger
@app.cosmos_db_trigger(
arg_name="documents",
container_name="orders",
database_name="ecommerce",
connection="CosmosConnection",
)
def main(documents: func.DocumentList):
for doc in documents:
publish_to_event_hub(doc)
Azure SQL / Synapse SQL — Synapse Link (built-in, separate analytical store)¶
Enable on the source table; Synapse SQL Serverless can query the analytical store with no impact on OLTP. Lag ~2 minutes.
SAP — Synapse Link for SAP (or Operational Data Provisioning)¶
For SAP S/4HANA + ECC. Reads SAP CDS views or ODP queues; lands in ADLS Delta. Better than ADF SAP Table connector for high-volume tables.
Postgres / MySQL / SQL Server (other) — Debezium¶
Run Debezium connectors as Container Apps:
# Container App for Debezium PostgreSQL connector
env:
- DATABASE_HOSTNAME: pg-primary.privatelink.postgres.database.azure.com
- DATABASE_USER: debezium
- DATABASE_PASSWORD: !KeyVaultRef pg-debezium-pwd
- TABLE_INCLUDE_LIST: public.orders,public.customers
- TOPIC_PREFIX: pg-prod
- KAFKA_BOOTSTRAP_SERVERS: ehns-prod.servicebus.windows.net:9093
- KAFKA_SASL_MECHANISM: PLAIN
Debezium publishes change events to Event Hubs (Kafka surface). Downstream consumers process from Event Hubs.
Other DBs¶
- DynamoDB: DynamoDB Streams → AWS Lambda → cross-cloud Event Grid → Event Hubs (rare; usually source moves to Cosmos)
- Mainframe (DB2 z/OS): third-party (Qlik Replicate, IBM Data Replication) → Event Hubs
- Salesforce: Salesforce Change Data Capture → ADF or Logic Apps → Event Hubs
Pattern: streaming medallion¶
Real-time medallion has the same layering, faster cadence:
flowchart LR
Source[Sources] --> EH[Event Hubs<br/>+ Capture]
EH -- Capture: bronze .parquet --> Bronze[(ADLS Bronze<br/>partitioned by hour)]
EH --> ASA[Stream Analytics or<br/>Fabric Eventstream]
ASA -- aggregated --> Silver[(ADLS Delta Silver<br/>10-second windows)]
Silver --> Gold[(ADLS Delta Gold<br/>1-min aggregates)]
EH -- direct --> ADX[Fabric Eventhouse<br/>or ADX]
ADX --> PBI[Power BI<br/>real-time dashboard]
Gold --> PBI2[Power BI<br/>near-real-time dashboard]
Silver --> AML[Azure ML<br/>online inference] Bronze = Capture-output Parquet (immutable). Silver = Delta tables with watermarks. Gold = Delta aggregates. Eventhouse / ADX = fast-query layer for dashboards.
Pattern: at-least-once vs exactly-once¶
| Need | Approach |
|---|---|
| At-least-once (default) | Idempotent consumers; deduplicate on consumer side using natural keys |
| Exactly-once on a single sink | Use a sink that supports it (Delta with merge keys, Cosmos with upsert) |
| Exactly-once across multiple sinks | Hard. Usually solved with outbox pattern (write event + sink record in one DB transaction; relay reads outbox and publishes) |
Don't claim exactly-once unless you've designed for it explicitly. Most "exactly-once" production systems are "at-least-once with idempotent consumers."
Pattern: late data + watermarks¶
In Structured Streaming or Stream Analytics:
# Spark Structured Streaming — drop late data >5 min
df.withWatermark("event_time", "5 minutes") \
.groupBy(window("event_time", "1 minute"), "device_id") \
.agg(avg("temp").alias("avg_temp"))
Pick a watermark that matches your data's latency reality. Too tight = drop legitimate data. Too loose = aggregations never finalize, state grows.
Cost guidance¶
| Tier | When |
|---|---|
| Event Hubs Standard | Most workloads up to ~50 MB/s |
| Event Hubs Premium | Better latency, dedicated bandwidth, geo-DR |
| Event Hubs Dedicated | >1M msg/s sustained, predictable cost |
| Stream Analytics | Per Streaming Unit (SU); minimum 1 SU = ~$80/mo |
| Databricks Structured Streaming | Per DBU; minimum cluster usually $200+/mo |
| Fabric RTI / Eventstream | Bundled in Fabric capacity (F-SKU) |
| Azure Data Explorer | Per-cluster (D11_v2 starts ~$300/mo) |
| Fabric Eventhouse | Bundled in Fabric capacity |
Anti-patterns¶
| Anti-pattern | What to do |
|---|---|
| Polling DB every 5 seconds | Use CDC (change feed, Synapse Link, Debezium) |
| Stream Analytics for stateful ML | Use Databricks; ASA SQL stateful is limited |
| Trying to use Synapse SQL for sub-second time-series queries on TB-scale | Use ADX / Eventhouse |
| Custom Kafka cluster on AKS for new workloads | Event Hubs Kafka surface — no clusters to operate |
| Functions for high-volume stream processing | Container Apps with KEDA scales better and cheaper |
| Bronze without Capture | You'll regret it the first time you need to replay |
| Single Event Hub for all topics | Per-topic per-domain — easier ops, RBAC, retention |
Related¶
- ADR 0005 — Event Hubs over Kafka
- ADR 0018 — Fabric RTI Adapter
- Decision — Batch vs Streaming
- Decision — Kafka vs Event Hubs vs Service Bus
- Reference Architecture — Data Flow (Medallion)
- Pattern — AKS & Container Apps for Data
- Pattern — Cosmos DB
- Tutorial 05 — Streaming Lambda
- Example — IoT Streaming
- Example — Streaming
- Supercharge Microsoft Fabric — Real-Time Intelligence — Fabric RTI deep dive (Eventstream, Eventhouse, KQL)
- Supercharge Microsoft Fabric — Real-Time Analytics Tutorial — hands-on RTI tutorial