🏗️ Azure Data Factory¶
Cloud-based data integration service for creating, scheduling, and orchestrating ETL/ELT data pipelines at scale.
🌟 Service Overview¶
Azure Data Factory (ADF) is a fully managed, serverless data integration service that enables you to create data-driven workflows for orchestrating data movement and transformation at scale. It provides a code-free UI for intuitive authoring and a comprehensive platform for complex hybrid ETL, ELT, and data integration projects.
🔥 Key Value Propositions¶
- Code-free ETL: Visual pipeline designer with drag-and-drop interface
- 90+ Connectors: Built-in connectors for cloud and on-premises data sources
- Serverless Compute: Auto-scaling data flows powered by Apache Spark
- Hybrid Integration: Seamlessly connect on-premises and cloud data sources
- Enterprise-grade: CI/CD support, monitoring, and security features
🏗️ Architecture Overview¶
graph TB
subgraph "Data Sources"
OnPrem[On-premises<br/>Databases]
Cloud[Cloud<br/>Services]
SaaS[SaaS<br/>Applications]
Files[File<br/>Systems]
end
subgraph "Azure Data Factory"
IR[Integration<br/>Runtime]
Pipeline[Pipelines &<br/>Activities]
DataFlow[Mapping<br/>Data Flows]
Control[Control<br/>Flow]
end
subgraph "Destinations"
Lake[Data Lake<br/>Storage]
DW[Synapse<br/>Analytics]
DB[Azure SQL<br/>Database]
Cosmos[Cosmos<br/>DB]
end
OnPrem --> IR
Cloud --> IR
SaaS --> IR
Files --> IR
IR --> Pipeline
Pipeline --> DataFlow
Pipeline --> Control
DataFlow --> Lake
DataFlow --> DW
Control --> DB
Control --> Cosmos 🛠️ Core Components¶
📊 Pipelines & Activities¶
What: Logical grouping of activities that perform a data workflow task.
Key Features:
- Copy Activity: Move data between sources and destinations
- Data Flow Activity: Transform data using visual data flows
- Stored Procedure Activity: Execute stored procedures
- Notebook Activity: Run Databricks notebooks
- Web Activity: Call custom REST endpoints
Use Cases:
- Data ingestion from multiple sources
- ETL/ELT pipeline orchestration
- Data migration and synchronization
🔄 Mapping Data Flows¶
Visual data transformation at scale without writing code.
Key Features:
- Visual Designer: Drag-and-drop transformation logic
- Spark Execution: Auto-scaled Spark clusters for processing
- 70+ Transformations: Join, aggregate, derive, filter, and more
- Debug Mode: Interactive data preview during development
Best For: Complex data transformations, data quality, aggregations
🔗 Integration Runtime¶
Compute infrastructure for data integration across different network environments.
Types:
- Azure IR: Cloud data sources and services
- Self-hosted IR: On-premises and private network sources
- Azure-SSIS IR: Lift-and-shift SSIS packages
Key Features:
- Hybrid connectivity
- High availability
- Network security
- Resource sharing
🔀 Pipeline Patterns¶
Common pipeline design patterns for different scenarios.
Patterns Covered:
- Copy Pattern: Simple data movement
- Parent-Child Pattern: Pipeline orchestration
- Iterative Pattern: Loop over datasets
- Conditional Pattern: Branching logic
- Dependency Pattern: Activity chaining
🚀 CI/CD for Data Factory¶
Enterprise DevOps practices for data pipelines.
Capabilities:
- Git Integration: Azure DevOps or GitHub
- ARM Templates: Infrastructure as code
- Environment Promotion: Dev → Test → Prod
- Automated Testing: Pipeline validation
🎯 Common Use Cases¶
📥 Data Ingestion¶
Ingest data from various sources into your data lake or warehouse.
Architecture: Source Systems → ADF → Data Lake/Synapse Pattern: Incremental Copy Pattern
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "CopyFromSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "SqlSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DataLakeSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM Orders WHERE ModifiedDate > '@{pipeline().parameters.watermark}'"
},
"sink": {
"type": "ParquetSink"
}
}
}
]
}
}
🔄 ETL/ELT Workflows¶
Transform and load data for analytics workloads.
Architecture: Sources → ADF Data Flows → Data Warehouse Pattern: Medallion Architecture
🏢 Enterprise Data Integration¶
Hybrid cloud and on-premises data integration.
Architecture: On-premises → Self-hosted IR → Cloud Services Pattern: Hybrid Integration
🔁 Real-time Data Sync¶
Near real-time data synchronization across systems.
Architecture: Source DB → ADF → Target DB Pattern: Change Data Capture
📊 Pricing Guide¶
💰 Cost Components¶
| Component | Pricing Model | Unit | Typical Cost |
|---|---|---|---|
| Pipeline Orchestration | Per activity run | 1,000 runs | $1.00 |
| Data Movement | Per DIU hour | DIU-hour | $0.25 |
| Data Flow (General Purpose) | Per vCore hour | vCore-hour | $0.274 |
| Data Flow (Memory Optimized) | Per vCore hour | vCore-hour | $0.548 |
| Self-hosted IR | Per node hour | Node-hour | Free |
| Azure-SSIS IR | Per node hour | Node-hour | Varies by size |
💡 Cost Optimization Tips¶
- Use Auto-pause for Data Flows: Set TTL to minimize idle compute costs
- Right-size DIUs: Start with default (4 DIUs) and adjust based on performance
- Batch Operations: Group multiple activities in single pipeline execution
- Self-hosted IR: Use existing on-premises compute to reduce costs
- Schedule Optimization: Run pipelines during off-peak hours when possible
- Incremental Loading: Use watermarks to process only changed data
💵 Example Monthly Cost:
Scenario: Daily ETL pipeline processing 100GB of data
Pipeline Runs: 30 runs/month × $1/1000 = $0.03
Data Movement: 100GB × 30 days × 0.5 DIU-hours × $0.25 = $375
Data Flow: 2 hours × 30 days × 16 vCores × $0.274 = $262.08
Total: ~$637/month
🚀 Quick Start Guide¶
1️⃣ Create Data Factory¶
# Create resource group
az group create --name rg-adf-demo --location eastus
# Create data factory
az datafactory create \
--resource-group rg-adf-demo \
--factory-name adf-demo-factory \
--location eastus
2️⃣ Create Linked Services¶
{
"name": "AzureSqlDatabase1",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;User ID=myuser;Password=********;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
3️⃣ Create Pipeline with Copy Activity¶
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
}
}
4️⃣ Trigger Pipeline¶
# Manual trigger
az datafactory pipeline create-run \
--resource-group rg-adf-demo \
--factory-name adf-demo-factory \
--pipeline-name CopyPipeline
# Create schedule trigger
az datafactory trigger create \
--resource-group rg-adf-demo \
--factory-name adf-demo-factory \
--trigger-name DailyTrigger \
--properties @trigger.json
🔧 Configuration & Management¶
🛡️ Security Configuration¶
Key Security Features:
- Managed Identity: Azure AD integration for passwordless authentication
- Azure Key Vault: Centralized secrets management
- Private Endpoints: Secure network connectivity
- Customer-Managed Keys: Encryption with your own keys
- IP Filtering: Restrict access by IP address
{
"name": "AzureKeyVaultLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://myvault.vault.azure.net/"
}
}
}
⚡ Performance Optimization¶
Key Performance Features:
- Parallel Copy: Partition data for parallel processing
- Data Integration Units (DIU): Scale copy activity performance
- Data Flow Cluster Sizing: Optimize Spark cluster configuration
- Compression: Enable compression for data transfer
- Staging: Use staged copy for better performance
{
"typeProperties": {
"source": {
"type": "SqlSource",
"partitionOption": "PhysicalPartitionsOfTable"
},
"sink": {
"type": "ParquetSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
},
"dataIntegrationUnits": 32
}
}
📊 Monitoring & Alerts¶
Built-in Monitoring:
- Pipeline Runs: Track execution status and duration
- Activity Runs: Detailed activity-level metrics
- Trigger Runs: Monitor scheduled executions
- Integration Runtime: Resource utilization metrics
- Data Flow Debug: Interactive debugging sessions
Azure Monitor Integration:
{
"diagnosticSettings": {
"logs": [
{
"category": "PipelineRuns",
"enabled": true
},
{
"category": "ActivityRuns",
"enabled": true
}
],
"metrics": [
{
"category": "AllMetrics",
"enabled": true
}
]
}
}
🔗 Integration Patterns¶
Synapse Analytics Integration¶
Direct integration with Azure Synapse for analytics workflows.
graph LR
Sources[Data Sources] --> ADF[Data Factory]
ADF --> Lake[Data Lake]
Lake --> Synapse[Synapse Analytics]
Synapse --> BI[Power BI] Databricks Integration¶
Execute Databricks notebooks from ADF pipelines.
{
"name": "RunDatabricksNotebook",
"type": "DatabricksNotebook",
"typeProperties": {
"notebookPath": "/Users/myuser/MyNotebook",
"baseParameters": {
"input": "value"
}
},
"linkedServiceName": {
"referenceName": "AzureDatabricks",
"type": "LinkedServiceReference"
}
}
Event-driven Pipelines¶
Trigger pipelines based on storage events.
{
"name": "BlobEventTrigger",
"type": "BlobEventsTrigger",
"properties": {
"events": [
"Microsoft.Storage.BlobCreated"
],
"scope": "/subscriptions/{subscription}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage}",
"blobPathBeginsWith": "/container/folder/",
"blobPathEndsWith": ".csv"
}
}
🎓 Learning Resources¶
🚀 Getting Started¶
📖 Deep Dive Guides¶
🔧 Advanced Topics¶
🆘 Troubleshooting¶
🔍 Common Issues¶
Copy Activity Failures¶
Problem: Copy activity fails with timeout errors
Solution: - Increase DIUs for large data volumes - Enable parallel copy with partitioning - Check network connectivity between source and sink
Self-hosted IR Connectivity¶
Problem: Self-hosted IR cannot connect to cloud services
Solution: - Verify firewall rules allow outbound connections - Check proxy configuration if applicable - Ensure IR has internet access for Azure service endpoints
Data Flow Performance¶
Problem: Data flows run slower than expected
Solution: - Increase compute cluster size - Enable partition optimization - Review transformation logic for bottlenecks - Use data flow debug to profile performance
📞 Getting Help¶
- Azure Support: Official Microsoft support channels
- Community Forums: Stack Overflow, Microsoft Q&A
- Documentation: Microsoft Learn
- GitHub: Azure Data Factory Feedback
📋 Related Resources¶
🔗 Service Documentation¶
📊 Architecture Patterns¶
💻 Code Examples¶
Last Updated: 2025-01-28 Service Version: V2 (Current) Documentation Status: Complete