🏗️ Azure Data Factory¶

Cloud-based data integration service for creating, scheduling, and orchestrating ETL/ELT data pipelines at scale.

🌟 Service Overview¶

Azure Data Factory (ADF) is a fully managed, serverless data integration service that enables you to create data-driven workflows for orchestrating data movement and transformation at scale. It provides a code-free UI for intuitive authoring and a comprehensive platform for complex hybrid ETL, ELT, and data integration projects.

🔥 Key Value Propositions¶

Code-free ETL: Visual pipeline designer with drag-and-drop interface
90+ Connectors: Built-in connectors for cloud and on-premises data sources
Serverless Compute: Auto-scaling data flows powered by Apache Spark
Hybrid Integration: Seamlessly connect on-premises and cloud data sources
Enterprise-grade: CI/CD support, monitoring, and security features

🏗️ Architecture Overview¶

graph TB
    subgraph "Data Sources"
        OnPrem[On-premises<br/>Databases]
        Cloud[Cloud<br/>Services]
        SaaS[SaaS<br/>Applications]
        Files[File<br/>Systems]
    end

    subgraph "Azure Data Factory"
        IR[Integration<br/>Runtime]
        Pipeline[Pipelines &<br/>Activities]
        DataFlow[Mapping<br/>Data Flows]
        Control[Control<br/>Flow]
    end

    subgraph "Destinations"
        Lake[Data Lake<br/>Storage]
        DW[Synapse<br/>Analytics]
        DB[Azure SQL<br/>Database]
        Cosmos[Cosmos<br/>DB]
    end

    OnPrem --> IR
    Cloud --> IR
    SaaS --> IR
    Files --> IR

    IR --> Pipeline
    Pipeline --> DataFlow
    Pipeline --> Control

    DataFlow --> Lake
    DataFlow --> DW
    Control --> DB
    Control --> Cosmos

🛠️ Core Components¶

📊 Pipelines & Activities¶

What: Logical grouping of activities that perform a data workflow task.

Key Features:

Copy Activity: Move data between sources and destinations
Data Flow Activity: Transform data using visual data flows
Stored Procedure Activity: Execute stored procedures
Notebook Activity: Run Databricks notebooks
Web Activity: Call custom REST endpoints

Use Cases:

Data ingestion from multiple sources
ETL/ELT pipeline orchestration
Data migration and synchronization

🔄 Mapping Data Flows ¶

Visual data transformation at scale without writing code.

Key Features:

Visual Designer: Drag-and-drop transformation logic
Spark Execution: Auto-scaled Spark clusters for processing
70+ Transformations: Join, aggregate, derive, filter, and more
Debug Mode: Interactive data preview during development

Best For: Complex data transformations, data quality, aggregations

📖 Detailed Guide →

🔗 Integration Runtime ¶

Compute infrastructure for data integration across different network environments.

Types:

Azure IR: Cloud data sources and services
Self-hosted IR: On-premises and private network sources
Azure-SSIS IR: Lift-and-shift SSIS packages

Key Features:

Hybrid connectivity
High availability
Network security
Resource sharing

📖 Detailed Guide →

🔀 Pipeline Patterns ¶

Common pipeline design patterns for different scenarios.

Patterns Covered:

Copy Pattern: Simple data movement
Parent-Child Pattern: Pipeline orchestration
Iterative Pattern: Loop over datasets
Conditional Pattern: Branching logic
Dependency Pattern: Activity chaining

📖 View Patterns →

🚀 CI/CD for Data Factory ¶

Enterprise DevOps practices for data pipelines.

Capabilities:

Git Integration: Azure DevOps or GitHub
ARM Templates: Infrastructure as code
Environment Promotion: Dev → Test → Prod
Automated Testing: Pipeline validation

📖 CI/CD Guide →

🎯 Common Use Cases¶

📥 Data Ingestion¶

Ingest data from various sources into your data lake or warehouse.

Architecture: Source Systems → ADF → Data Lake/Synapse Pattern: Incremental Copy Pattern

{
  "name": "IncrementalCopyPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyFromSQL",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "SqlSource",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "DataLakeSink",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "SqlSource",
            "sqlReaderQuery": "SELECT * FROM Orders WHERE ModifiedDate > '@{pipeline().parameters.watermark}'"
          },
          "sink": {
            "type": "ParquetSink"
          }
        }
      }
    ]
  }
}

🔄 ETL/ELT Workflows¶

Transform and load data for analytics workloads.

Architecture: Sources → ADF Data Flows → Data Warehouse Pattern: Medallion Architecture

🏢 Enterprise Data Integration¶

Hybrid cloud and on-premises data integration.

Architecture: On-premises → Self-hosted IR → Cloud Services Pattern: Hybrid Integration

🔁 Real-time Data Sync¶

Near real-time data synchronization across systems.

Architecture: Source DB → ADF → Target DB Pattern: Change Data Capture

📊 Pricing Guide¶

💰 Cost Components¶

Component	Pricing Model	Unit	Typical Cost
Pipeline Orchestration	Per activity run	1,000 runs	$1.00
Data Movement	Per DIU hour	DIU-hour	$0.25
Data Flow (General Purpose)	Per vCore hour	vCore-hour	$0.274
Data Flow (Memory Optimized)	Per vCore hour	vCore-hour	$0.548
Self-hosted IR	Per node hour	Node-hour	Free
Azure-SSIS IR	Per node hour	Node-hour	Varies by size

💡 Cost Optimization Tips¶

Use Auto-pause for Data Flows: Set TTL to minimize idle compute costs
Right-size DIUs: Start with default (4 DIUs) and adjust based on performance
Batch Operations: Group multiple activities in single pipeline execution
Self-hosted IR: Use existing on-premises compute to reduce costs
Schedule Optimization: Run pipelines during off-peak hours when possible
Incremental Loading: Use watermarks to process only changed data

💵 Example Monthly Cost:

Scenario: Daily ETL pipeline processing 100GB of data

Pipeline Runs:       30 runs/month × $1/1000 = $0.03
Data Movement:       100GB × 30 days × 0.5 DIU-hours × $0.25 = $375
Data Flow:           2 hours × 30 days × 16 vCores × $0.274 = $262.08

Total: ~$637/month

📖 Detailed Pricing Guide →

🚀 Quick Start Guide¶

1️⃣ Create Data Factory¶

# Create resource group
az group create --name rg-adf-demo --location eastus

# Create data factory
az datafactory create \
  --resource-group rg-adf-demo \
  --factory-name adf-demo-factory \
  --location eastus

2️⃣ Create Linked Services¶

{
  "name": "AzureSqlDatabase1",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;User ID=myuser;Password=********;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
    }
  }
}

3️⃣ Create Pipeline with Copy Activity¶

{
  "name": "CopyPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "SourceDataset",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "SinkDataset",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "SqlSource"
          },
          "sink": {
            "type": "BlobSink"
          }
        }
      }
    ]
  }
}

4️⃣ Trigger Pipeline¶

# Manual trigger
az datafactory pipeline create-run \
  --resource-group rg-adf-demo \
  --factory-name adf-demo-factory \
  --pipeline-name CopyPipeline

# Create schedule trigger
az datafactory trigger create \
  --resource-group rg-adf-demo \
  --factory-name adf-demo-factory \
  --trigger-name DailyTrigger \
  --properties @trigger.json

🔧 Configuration & Management¶

🛡️ Security Configuration¶

Key Security Features:

Managed Identity: Azure AD integration for passwordless authentication
Azure Key Vault: Centralized secrets management
Private Endpoints: Secure network connectivity
Customer-Managed Keys: Encryption with your own keys
IP Filtering: Restrict access by IP address

{
  "name": "AzureKeyVaultLinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureKeyVault",
    "typeProperties": {
      "baseUrl": "https://myvault.vault.azure.net/"
    }
  }
}

📖 Security Best Practices →

⚡ Performance Optimization¶

Key Performance Features:

Parallel Copy: Partition data for parallel processing
Data Integration Units (DIU): Scale copy activity performance
Data Flow Cluster Sizing: Optimize Spark cluster configuration
Compression: Enable compression for data transfer
Staging: Use staged copy for better performance

{
  "typeProperties": {
    "source": {
      "type": "SqlSource",
      "partitionOption": "PhysicalPartitionsOfTable"
    },
    "sink": {
      "type": "ParquetSink"
    },
    "enableStaging": true,
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
      }
    },
    "dataIntegrationUnits": 32
  }
}

📖 Performance Tuning Guide →

📊 Monitoring & Alerts¶

Built-in Monitoring:

Pipeline Runs: Track execution status and duration
Activity Runs: Detailed activity-level metrics
Trigger Runs: Monitor scheduled executions
Integration Runtime: Resource utilization metrics
Data Flow Debug: Interactive debugging sessions

Azure Monitor Integration:

{
  "diagnosticSettings": {
    "logs": [
      {
        "category": "PipelineRuns",
        "enabled": true
      },
      {
        "category": "ActivityRuns",
        "enabled": true
      }
    ],
    "metrics": [
      {
        "category": "AllMetrics",
        "enabled": true
      }
    ]
  }
}

📖 Monitoring Guide →

🔗 Integration Patterns¶

Synapse Analytics Integration¶

Direct integration with Azure Synapse for analytics workflows.

graph LR
    Sources[Data Sources] --> ADF[Data Factory]
    ADF --> Lake[Data Lake]
    Lake --> Synapse[Synapse Analytics]
    Synapse --> BI[Power BI]

Databricks Integration¶

Execute Databricks notebooks from ADF pipelines.

{
  "name": "RunDatabricksNotebook",
  "type": "DatabricksNotebook",
  "typeProperties": {
    "notebookPath": "/Users/myuser/MyNotebook",
    "baseParameters": {
      "input": "value"
    }
  },
  "linkedServiceName": {
    "referenceName": "AzureDatabricks",
    "type": "LinkedServiceReference"
  }
}

Event-driven Pipelines¶

Trigger pipelines based on storage events.

{
  "name": "BlobEventTrigger",
  "type": "BlobEventsTrigger",
  "properties": {
    "events": [
      "Microsoft.Storage.BlobCreated"
    ],
    "scope": "/subscriptions/{subscription}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage}",
    "blobPathBeginsWith": "/container/folder/",
    "blobPathEndsWith": ".csv"
  }
}

📖 Integration Examples →

🎓 Learning Resources¶

🚀 Getting Started¶

📖 Deep Dive Guides¶

🔧 Advanced Topics¶

🆘 Troubleshooting¶

🔍 Common Issues¶

Copy Activity Failures¶

Problem: Copy activity fails with timeout errors

Solution: - Increase DIUs for large data volumes - Enable parallel copy with partitioning - Check network connectivity between source and sink

Self-hosted IR Connectivity¶

Problem: Self-hosted IR cannot connect to cloud services

Solution: - Verify firewall rules allow outbound connections - Check proxy configuration if applicable - Ensure IR has internet access for Azure service endpoints

Data Flow Performance¶

Problem: Data flows run slower than expected

Solution: - Increase compute cluster size - Enable partition optimization - Review transformation logic for bottlenecks - Use data flow debug to profile performance

📞 Getting Help¶

Azure Support: Official Microsoft support channels
Community Forums: Stack Overflow, Microsoft Q&A
Documentation: Microsoft Learn
GitHub: Azure Data Factory Feedback

📖 Troubleshooting Guide →

🔗 Service Documentation¶

📊 Architecture Patterns¶

💻 Code Examples¶

Last Updated: 2025-01-28 Service Version: V2 (Current) Documentation Status: Complete

🏗️ Azure Data Factory¶

🌟 Service Overview¶

🔥 Key Value Propositions¶

🏗️ Architecture Overview¶

🛠️ Core Components¶

📊 Pipelines & Activities¶

🔄 Mapping Data Flows¶

🔗 Integration Runtime¶

🔀 Pipeline Patterns¶

🚀 CI/CD for Data Factory¶

🎯 Common Use Cases¶

📥 Data Ingestion¶

🔄 ETL/ELT Workflows¶

🏢 Enterprise Data Integration¶

🔁 Real-time Data Sync¶

📊 Pricing Guide¶

💰 Cost Components¶

💡 Cost Optimization Tips¶

🚀 Quick Start Guide¶

1️⃣ Create Data Factory¶

2️⃣ Create Linked Services¶

3️⃣ Create Pipeline with Copy Activity¶

4️⃣ Trigger Pipeline¶

🔧 Configuration & Management¶

🛡️ Security Configuration¶

⚡ Performance Optimization¶

📊 Monitoring & Alerts¶

🔗 Integration Patterns¶

Synapse Analytics Integration¶

Databricks Integration¶

Event-driven Pipelines¶

🎓 Learning Resources¶

🚀 Getting Started¶

📖 Deep Dive Guides¶

🔧 Advanced Topics¶

🆘 Troubleshooting¶

🔍 Common Issues¶

Copy Activity Failures¶

Self-hosted IR Connectivity¶

Data Flow Performance¶

📞 Getting Help¶

📋 Related Resources¶

🔗 Service Documentation¶

📊 Architecture Patterns¶

💻 Code Examples¶

🔄 Mapping Data Flows ¶

🔗 Integration Runtime ¶

🔀 Pipeline Patterns ¶

🚀 CI/CD for Data Factory ¶