Skip to content

🏗️ Azure Data Factory

Status Type Complexity

Cloud-based data integration service for creating, scheduling, and orchestrating ETL/ELT data pipelines at scale.


🌟 Service Overview

Azure Data Factory (ADF) is a fully managed, serverless data integration service that enables you to create data-driven workflows for orchestrating data movement and transformation at scale. It provides a code-free UI for intuitive authoring and a comprehensive platform for complex hybrid ETL, ELT, and data integration projects.

🔥 Key Value Propositions

  • Code-free ETL: Visual pipeline designer with drag-and-drop interface
  • 90+ Connectors: Built-in connectors for cloud and on-premises data sources
  • Serverless Compute: Auto-scaling data flows powered by Apache Spark
  • Hybrid Integration: Seamlessly connect on-premises and cloud data sources
  • Enterprise-grade: CI/CD support, monitoring, and security features

🏗️ Architecture Overview

graph TB
    subgraph "Data Sources"
        OnPrem[On-premises<br/>Databases]
        Cloud[Cloud<br/>Services]
        SaaS[SaaS<br/>Applications]
        Files[File<br/>Systems]
    end

    subgraph "Azure Data Factory"
        IR[Integration<br/>Runtime]
        Pipeline[Pipelines &<br/>Activities]
        DataFlow[Mapping<br/>Data Flows]
        Control[Control<br/>Flow]
    end

    subgraph "Destinations"
        Lake[Data Lake<br/>Storage]
        DW[Synapse<br/>Analytics]
        DB[Azure SQL<br/>Database]
        Cosmos[Cosmos<br/>DB]
    end

    OnPrem --> IR
    Cloud --> IR
    SaaS --> IR
    Files --> IR

    IR --> Pipeline
    Pipeline --> DataFlow
    Pipeline --> Control

    DataFlow --> Lake
    DataFlow --> DW
    Control --> DB
    Control --> Cosmos

🛠️ Core Components

📊 Pipelines & Activities

What: Logical grouping of activities that perform a data workflow task.

Key Features:

  • Copy Activity: Move data between sources and destinations
  • Data Flow Activity: Transform data using visual data flows
  • Stored Procedure Activity: Execute stored procedures
  • Notebook Activity: Run Databricks notebooks
  • Web Activity: Call custom REST endpoints

Use Cases:

  • Data ingestion from multiple sources
  • ETL/ELT pipeline orchestration
  • Data migration and synchronization

🔄 Mapping Data Flows

Spark Powered

Visual data transformation at scale without writing code.

Key Features:

  • Visual Designer: Drag-and-drop transformation logic
  • Spark Execution: Auto-scaled Spark clusters for processing
  • 70+ Transformations: Join, aggregate, derive, filter, and more
  • Debug Mode: Interactive data preview during development

Best For: Complex data transformations, data quality, aggregations

📖 Detailed Guide →


🔗 Integration Runtime

Hybrid

Compute infrastructure for data integration across different network environments.

Types:

  • Azure IR: Cloud data sources and services
  • Self-hosted IR: On-premises and private network sources
  • Azure-SSIS IR: Lift-and-shift SSIS packages

Key Features:

  • Hybrid connectivity
  • High availability
  • Network security
  • Resource sharing

📖 Detailed Guide →


🔀 Pipeline Patterns

Common pipeline design patterns for different scenarios.

Patterns Covered:

  • Copy Pattern: Simple data movement
  • Parent-Child Pattern: Pipeline orchestration
  • Iterative Pattern: Loop over datasets
  • Conditional Pattern: Branching logic
  • Dependency Pattern: Activity chaining

📖 View Patterns →


🚀 CI/CD for Data Factory

Enterprise DevOps practices for data pipelines.

Capabilities:

  • Git Integration: Azure DevOps or GitHub
  • ARM Templates: Infrastructure as code
  • Environment Promotion: Dev → Test → Prod
  • Automated Testing: Pipeline validation

📖 CI/CD Guide →


🎯 Common Use Cases

📥 Data Ingestion

Ingest data from various sources into your data lake or warehouse.

Architecture: Source Systems → ADF → Data Lake/Synapse Pattern: Incremental Copy Pattern

{
  "name": "IncrementalCopyPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyFromSQL",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "SqlSource",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "DataLakeSink",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "SqlSource",
            "sqlReaderQuery": "SELECT * FROM Orders WHERE ModifiedDate > '@{pipeline().parameters.watermark}'"
          },
          "sink": {
            "type": "ParquetSink"
          }
        }
      }
    ]
  }
}

🔄 ETL/ELT Workflows

Transform and load data for analytics workloads.

Architecture: Sources → ADF Data Flows → Data Warehouse Pattern: Medallion Architecture

🏢 Enterprise Data Integration

Hybrid cloud and on-premises data integration.

Architecture: On-premises → Self-hosted IR → Cloud Services Pattern: Hybrid Integration

🔁 Real-time Data Sync

Near real-time data synchronization across systems.

Architecture: Source DB → ADF → Target DB Pattern: Change Data Capture


📊 Pricing Guide

💰 Cost Components

Component Pricing Model Unit Typical Cost
Pipeline Orchestration Per activity run 1,000 runs $1.00
Data Movement Per DIU hour DIU-hour $0.25
Data Flow (General Purpose) Per vCore hour vCore-hour $0.274
Data Flow (Memory Optimized) Per vCore hour vCore-hour $0.548
Self-hosted IR Per node hour Node-hour Free
Azure-SSIS IR Per node hour Node-hour Varies by size

💡 Cost Optimization Tips

  1. Use Auto-pause for Data Flows: Set TTL to minimize idle compute costs
  2. Right-size DIUs: Start with default (4 DIUs) and adjust based on performance
  3. Batch Operations: Group multiple activities in single pipeline execution
  4. Self-hosted IR: Use existing on-premises compute to reduce costs
  5. Schedule Optimization: Run pipelines during off-peak hours when possible
  6. Incremental Loading: Use watermarks to process only changed data

💵 Example Monthly Cost:

Scenario: Daily ETL pipeline processing 100GB of data

Pipeline Runs:       30 runs/month × $1/1000 = $0.03
Data Movement:       100GB × 30 days × 0.5 DIU-hours × $0.25 = $375
Data Flow:           2 hours × 30 days × 16 vCores × $0.274 = $262.08

Total: ~$637/month

📖 Detailed Pricing Guide →


🚀 Quick Start Guide

1️⃣ Create Data Factory

# Create resource group
az group create --name rg-adf-demo --location eastus

# Create data factory
az datafactory create \
  --resource-group rg-adf-demo \
  --factory-name adf-demo-factory \
  --location eastus

2️⃣ Create Linked Services

{
  "name": "AzureSqlDatabase1",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;User ID=myuser;Password=********;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
    }
  }
}

3️⃣ Create Pipeline with Copy Activity

{
  "name": "CopyPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyData",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "SourceDataset",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "SinkDataset",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "SqlSource"
          },
          "sink": {
            "type": "BlobSink"
          }
        }
      }
    ]
  }
}

4️⃣ Trigger Pipeline

# Manual trigger
az datafactory pipeline create-run \
  --resource-group rg-adf-demo \
  --factory-name adf-demo-factory \
  --pipeline-name CopyPipeline

# Create schedule trigger
az datafactory trigger create \
  --resource-group rg-adf-demo \
  --factory-name adf-demo-factory \
  --trigger-name DailyTrigger \
  --properties @trigger.json

🔧 Configuration & Management

🛡️ Security Configuration

Key Security Features:

  • Managed Identity: Azure AD integration for passwordless authentication
  • Azure Key Vault: Centralized secrets management
  • Private Endpoints: Secure network connectivity
  • Customer-Managed Keys: Encryption with your own keys
  • IP Filtering: Restrict access by IP address
{
  "name": "AzureKeyVaultLinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureKeyVault",
    "typeProperties": {
      "baseUrl": "https://myvault.vault.azure.net/"
    }
  }
}

📖 Security Best Practices →

⚡ Performance Optimization

Key Performance Features:

  • Parallel Copy: Partition data for parallel processing
  • Data Integration Units (DIU): Scale copy activity performance
  • Data Flow Cluster Sizing: Optimize Spark cluster configuration
  • Compression: Enable compression for data transfer
  • Staging: Use staged copy for better performance
{
  "typeProperties": {
    "source": {
      "type": "SqlSource",
      "partitionOption": "PhysicalPartitionsOfTable"
    },
    "sink": {
      "type": "ParquetSink"
    },
    "enableStaging": true,
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
      }
    },
    "dataIntegrationUnits": 32
  }
}

📖 Performance Tuning Guide →

📊 Monitoring & Alerts

Built-in Monitoring:

  • Pipeline Runs: Track execution status and duration
  • Activity Runs: Detailed activity-level metrics
  • Trigger Runs: Monitor scheduled executions
  • Integration Runtime: Resource utilization metrics
  • Data Flow Debug: Interactive debugging sessions

Azure Monitor Integration:

{
  "diagnosticSettings": {
    "logs": [
      {
        "category": "PipelineRuns",
        "enabled": true
      },
      {
        "category": "ActivityRuns",
        "enabled": true
      }
    ],
    "metrics": [
      {
        "category": "AllMetrics",
        "enabled": true
      }
    ]
  }
}

📖 Monitoring Guide →


🔗 Integration Patterns

Synapse Analytics Integration

Direct integration with Azure Synapse for analytics workflows.

graph LR
    Sources[Data Sources] --> ADF[Data Factory]
    ADF --> Lake[Data Lake]
    Lake --> Synapse[Synapse Analytics]
    Synapse --> BI[Power BI]

Databricks Integration

Execute Databricks notebooks from ADF pipelines.

{
  "name": "RunDatabricksNotebook",
  "type": "DatabricksNotebook",
  "typeProperties": {
    "notebookPath": "/Users/myuser/MyNotebook",
    "baseParameters": {
      "input": "value"
    }
  },
  "linkedServiceName": {
    "referenceName": "AzureDatabricks",
    "type": "LinkedServiceReference"
  }
}

Event-driven Pipelines

Trigger pipelines based on storage events.

{
  "name": "BlobEventTrigger",
  "type": "BlobEventsTrigger",
  "properties": {
    "events": [
      "Microsoft.Storage.BlobCreated"
    ],
    "scope": "/subscriptions/{subscription}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage}",
    "blobPathBeginsWith": "/container/folder/",
    "blobPathEndsWith": ".csv"
  }
}

📖 Integration Examples →


🎓 Learning Resources

🚀 Getting Started

📖 Deep Dive Guides

🔧 Advanced Topics


🆘 Troubleshooting

🔍 Common Issues

Copy Activity Failures

Problem: Copy activity fails with timeout errors

Solution: - Increase DIUs for large data volumes - Enable parallel copy with partitioning - Check network connectivity between source and sink

Self-hosted IR Connectivity

Problem: Self-hosted IR cannot connect to cloud services

Solution: - Verify firewall rules allow outbound connections - Check proxy configuration if applicable - Ensure IR has internet access for Azure service endpoints

Data Flow Performance

Problem: Data flows run slower than expected

Solution: - Increase compute cluster size - Enable partition optimization - Review transformation logic for bottlenecks - Use data flow debug to profile performance

📞 Getting Help

📖 Troubleshooting Guide →


🔗 Service Documentation

📊 Architecture Patterns

💻 Code Examples


Last Updated: 2025-01-28 Service Version: V2 (Current) Documentation Status: Complete