Skip to content

🎯 Azure Data Factory Fundamentals

Tutorial Duration Level

Master the foundational concepts of Azure Data Factory (ADF) including architecture, components, and design patterns for enterprise data integration.

📋 Table of Contents

🌟 What is Azure Data Factory?

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation.

Key Capabilities

  • Data Movement: Copy data between 90+ supported data sources
  • Data Transformation: Transform data using compute services like Azure Databricks, HDInsight, and Synapse
  • Orchestration: Build complex workflows with conditional logic and dependencies
  • Scheduling: Execute pipelines on-demand or with automated triggers
  • Monitoring: Track pipeline execution and performance metrics

ADF vs Traditional ETL Tools

Feature Traditional ETL Azure Data Factory
Infrastructure On-premises servers Serverless cloud service
Scalability Manual scaling Auto-scaling
Pricing Model License + hardware costs Pay-per-use
Maintenance Manual updates Managed service
Integration Limited connectors 90+ native connectors
Development Code-heavy Visual + code options

🏗️ Core Components

Pipelines

A pipeline is a logical grouping of activities that together perform a task.

{
  "name": "CopyPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyFromBlobToSQL",
        "type": "Copy",
        "inputs": [{"referenceName": "BlobDataset"}],
        "outputs": [{"referenceName": "SQLDataset"}]
      }
    ]
  }
}

Pipeline Characteristics:

  • Contain one or more activities
  • Can be parameterized for reusability
  • Support dependencies and conditional execution
  • Can be triggered manually or automatically

Activities

Activities represent processing steps within a pipeline.

Activity Categories:

  1. Data Movement Activities
  2. Copy Activity: Move data between sources
  3. Data Flow: Transform data at scale

  4. Data Transformation Activities

  5. Databricks Notebook
  6. HDInsight Hive/Pig/Spark
  7. Stored Procedure
  8. Custom Activity

  9. Control Activities

  10. ForEach: Iterate over collections
  11. If Condition: Conditional branching
  12. Wait: Add delays
  13. Web Activity: Call REST APIs

Datasets

Datasets identify data within data stores.

{
  "name": "AzureSQLDataset",
  "properties": {
    "linkedServiceName": {
      "referenceName": "AzureSQLLinkedService",
      "type": "LinkedServiceReference"
    },
    "type": "AzureSqlTable",
    "typeProperties": {
      "tableName": "dbo.Customer"
    }
  }
}

Dataset Properties:

  • Schema: Structure of the data
  • Location: Where the data resides
  • Format: Parquet, CSV, JSON, Avro, etc.
  • Partitioning: How data is organized

Linked Services

Linked services define connection information to data sources.

Types of Linked Services:

  1. Data Stores: Azure Blob, Azure SQL, On-premises SQL Server
  2. Compute Services: Azure Databricks, HDInsight, Synapse
  3. Other Services: Azure Key Vault, Azure Function
{
  "name": "AzureSQLLinkedService",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;",
      "authenticationType": "ServicePrincipal",
      "servicePrincipalId": "xxxx-xxxx-xxxx-xxxx",
      "servicePrincipalKey": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "sql-password"
      },
      "tenant": "xxxx-xxxx-xxxx-xxxx"
    }
  }
}

Integration Runtime

The Integration Runtime (IR) provides the compute infrastructure for data movement and transformation.

IR Types:

  1. Azure Integration Runtime
  2. Serverless compute
  3. Data movement between cloud data stores
  4. Dispatch activities to compute services

  5. Self-Hosted Integration Runtime

  6. Installed on on-premises or VM
  7. Access data behind firewalls
  8. Private network connectivity

  9. Azure-SSIS Integration Runtime

  10. Execute SSIS packages in the cloud
  11. Lift-and-shift SSIS workloads

Triggers

Triggers determine when pipeline execution should start.

Trigger Types:

Trigger Type Description Use Case
Schedule Time-based execution Daily batch processing
Tumbling Window Fixed-size, non-overlapping intervals Hourly aggregations
Event-Based Responds to events (file arrival) Real-time processing
Manual On-demand execution Ad-hoc processing

🎨 Architecture Overview

High-Level Architecture

graph TB
    subgraph "Data Sources"
        A[On-Premises DBs]
        B[SaaS Applications]
        C[File Systems]
        D[Cloud Storage]
    end

    subgraph "Azure Data Factory"
        E[Self-Hosted IR]
        F[Azure IR]
        G[Pipeline Orchestration]
        H[Mapping Data Flows]
    end

    subgraph "Destinations"
        I[Azure Synapse]
        J[Azure SQL DB]
        K[Data Lake]
        L[Power BI]
    end

    A --> E
    B --> F
    C --> E
    D --> F

    E --> G
    F --> G

    G --> H
    H --> I
    H --> J
    H --> K
    K --> L

Data Factory Workflow

sequenceDiagram
    participant Source
    participant IR as Integration Runtime
    participant Pipeline
    participant DataFlow
    participant Destination

    Pipeline->>IR: Start Execution
    IR->>Source: Connect & Read Data
    Source->>IR: Return Data
    IR->>DataFlow: Pass Data for Transformation
    DataFlow->>DataFlow: Apply Transformations
    DataFlow->>IR: Return Transformed Data
    IR->>Destination: Write Data
    Destination->>Pipeline: Confirm Success

🔄 Integration Patterns

Pattern 1: Extract-Load-Transform (ELT)

Load raw data first, then transform in the destination system.

Benefits:

  • Preserve raw data for reprocessing
  • Leverage destination compute power
  • Faster initial data loading

When to Use:

  • Working with cloud data warehouses (Synapse, Snowflake)
  • Need to preserve raw data
  • Transformations are complex and compute-intensive

Pattern 2: Extract-Transform-Load (ETL)

Transform data during the movement process.

Benefits:

  • Reduced storage requirements
  • Data quality validation before loading
  • Pre-aggregated data

When to Use:

  • Simple transformations
  • Limited destination storage
  • Need data validation before loading

Pattern 3: Incremental Loading

Load only new or changed data since last execution.

{
  "source": {
    "type": "AzureSqlSource",
    "sqlReaderQuery": "SELECT * FROM Sales WHERE ModifiedDate > '@{pipeline().parameters.LastRunTime}'"
  }
}

Benefits:

  • Reduced data transfer
  • Lower processing costs
  • Faster execution times

Pattern 4: Event-Driven Processing

Trigger pipelines based on data arrival or business events.

Components:

  • Event Grid for file arrival detection
  • Storage Event Triggers
  • Service Bus for business events

💼 Use Cases

Use Case 1: Data Warehouse Migration

Scenario: Migrate on-premises SQL Server data warehouse to Azure Synapse.

ADF Solution:

  • Self-hosted IR for secure connectivity
  • Incremental copy activities for large tables
  • Mapping data flows for transformation logic
  • Scheduling triggers for automated sync

Use Case 2: Multi-Cloud Data Integration

Scenario: Consolidate data from AWS, GCP, and Azure into a unified data lake.

ADF Solution:

  • Amazon S3 and Google Cloud Storage connectors
  • Azure Data Lake Storage Gen2 as destination
  • Data flows for standardization and cleansing
  • Azure Purview for metadata management

Use Case 3: Real-Time Analytics Pipeline

Scenario: Process streaming IoT data for real-time dashboards.

ADF Solution:

  • Event Hub or IoT Hub as source
  • Stream Analytics for real-time processing
  • ADF for batch aggregation and archival
  • Power BI for visualization

Use Case 4: SaaS Data Consolidation

Scenario: Integrate data from Salesforce, Dynamics 365, and ServiceNow.

ADF Solution:

  • Built-in SaaS connectors
  • Scheduled triggers for daily sync
  • Data flows for normalization
  • Azure SQL Database for consolidated view

✅ When to Use ADF

ADF is Ideal For:

  • Cloud-Native Data Integration: Moving data between Azure services
  • Hybrid Scenarios: Connecting on-premises and cloud data sources
  • Complex Orchestration: Multi-step workflows with dependencies
  • Scalable Processing: Large-volume data movement and transformation
  • Managed Service Benefits: Minimal infrastructure management

Consider Alternatives When:

  • Real-Time Streaming: Use Azure Stream Analytics or Event Hubs
  • Complex Transformations: Consider Azure Databricks or Synapse Spark
  • Simple File Transfers: AzCopy or Storage Explorer might suffice
  • Code-First Development: Azure Functions or custom applications

🎯 Key Concepts Summary

Concept Description Example
Pipeline Logical workflow container Daily sales data processing
Activity Processing step Copy data from blob to SQL
Dataset Data reference Customer table in SQL database
Linked Service Connection definition Azure SQL Database connection
Integration Runtime Compute infrastructure Self-hosted IR for on-premises access
Trigger Execution initiator Daily at 2 AM schedule
Parameter Dynamic configuration Date range for data extraction

📚 Additional Resources

🎯 Knowledge Check

Before proceeding, ensure you understand:

  • What Azure Data Factory is and its core capabilities
  • The six main components of ADF (Pipeline, Activity, Dataset, Linked Service, IR, Trigger)
  • Difference between ETL and ELT patterns
  • When to use Azure IR vs Self-Hosted IR
  • Common ADF use cases and integration patterns

🚀 Next Steps

Now that you understand ADF fundamentals, proceed to:

02. Environment Setup - Create and configure your first Data Factory


Module Progress: 1 of 18 complete

Tutorial Version: 1.0 Last Updated: January 2025