🎯 Azure Data Factory Fundamentals¶

Master the foundational concepts of Azure Data Factory (ADF) including architecture, components, and design patterns for enterprise data integration.

📋 Table of Contents¶

What is Azure Data Factory?
Core Components
Architecture Overview
Integration Patterns
Use Cases
When to Use ADF
Next Steps

🌟 What is Azure Data Factory?¶

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation.

Key Capabilities¶

Data Movement: Copy data between 90+ supported data sources
Data Transformation: Transform data using compute services like Azure Databricks, HDInsight, and Synapse
Orchestration: Build complex workflows with conditional logic and dependencies
Scheduling: Execute pipelines on-demand or with automated triggers
Monitoring: Track pipeline execution and performance metrics

ADF vs Traditional ETL Tools¶

Feature	Traditional ETL	Azure Data Factory
Infrastructure	On-premises servers	Serverless cloud service
Scalability	Manual scaling	Auto-scaling
Pricing Model	License + hardware costs	Pay-per-use
Maintenance	Manual updates	Managed service
Integration	Limited connectors	90+ native connectors
Development	Code-heavy	Visual + code options

🏗️ Core Components¶

Pipelines¶

A pipeline is a logical grouping of activities that together perform a task.

{
  "name": "CopyPipeline",
  "properties": {
    "activities": [
      {
        "name": "CopyFromBlobToSQL",
        "type": "Copy",
        "inputs": [{"referenceName": "BlobDataset"}],
        "outputs": [{"referenceName": "SQLDataset"}]
      }
    ]
  }
}

Pipeline Characteristics:

Contain one or more activities
Can be parameterized for reusability
Support dependencies and conditional execution
Can be triggered manually or automatically

Activities¶

Activities represent processing steps within a pipeline.

Activity Categories:

Data Movement Activities
Copy Activity: Move data between sources
Data Flow: Transform data at scale
Data Transformation Activities
Databricks Notebook
HDInsight Hive/Pig/Spark
Stored Procedure
Custom Activity
Control Activities
ForEach: Iterate over collections
If Condition: Conditional branching
Wait: Add delays
Web Activity: Call REST APIs

Datasets¶

Datasets identify data within data stores.

{
  "name": "AzureSQLDataset",
  "properties": {
    "linkedServiceName": {
      "referenceName": "AzureSQLLinkedService",
      "type": "LinkedServiceReference"
    },
    "type": "AzureSqlTable",
    "typeProperties": {
      "tableName": "dbo.Customer"
    }
  }
}

Dataset Properties:

Schema: Structure of the data
Location: Where the data resides
Format: Parquet, CSV, JSON, Avro, etc.
Partitioning: How data is organized

Linked Services¶

Linked services define connection information to data sources.

Types of Linked Services:

Data Stores: Azure Blob, Azure SQL, On-premises SQL Server
Compute Services: Azure Databricks, HDInsight, Synapse
Other Services: Azure Key Vault, Azure Function

{
  "name": "AzureSQLLinkedService",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;",
      "authenticationType": "ServicePrincipal",
      "servicePrincipalId": "xxxx-xxxx-xxxx-xxxx",
      "servicePrincipalKey": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "sql-password"
      },
      "tenant": "xxxx-xxxx-xxxx-xxxx"
    }
  }
}

Integration Runtime¶

The Integration Runtime (IR) provides the compute infrastructure for data movement and transformation.

IR Types:

Azure Integration Runtime
Serverless compute
Data movement between cloud data stores
Dispatch activities to compute services
Self-Hosted Integration Runtime
Installed on on-premises or VM
Access data behind firewalls
Private network connectivity
Azure-SSIS Integration Runtime
Execute SSIS packages in the cloud
Lift-and-shift SSIS workloads

Triggers¶

Triggers determine when pipeline execution should start.

Trigger Types:

Trigger Type	Description	Use Case
Schedule	Time-based execution	Daily batch processing
Tumbling Window	Fixed-size, non-overlapping intervals	Hourly aggregations
Event-Based	Responds to events (file arrival)	Real-time processing
Manual	On-demand execution	Ad-hoc processing

🎨 Architecture Overview¶

High-Level Architecture¶

graph TB
    subgraph "Data Sources"
        A[On-Premises DBs]
        B[SaaS Applications]
        C[File Systems]
        D[Cloud Storage]
    end

    subgraph "Azure Data Factory"
        E[Self-Hosted IR]
        F[Azure IR]
        G[Pipeline Orchestration]
        H[Mapping Data Flows]
    end

    subgraph "Destinations"
        I[Azure Synapse]
        J[Azure SQL DB]
        K[Data Lake]
        L[Power BI]
    end

    A --> E
    B --> F
    C --> E
    D --> F

    E --> G
    F --> G

    G --> H
    H --> I
    H --> J
    H --> K
    K --> L

Data Factory Workflow¶

sequenceDiagram
    participant Source
    participant IR as Integration Runtime
    participant Pipeline
    participant DataFlow
    participant Destination

    Pipeline->>IR: Start Execution
    IR->>Source: Connect & Read Data
    Source->>IR: Return Data
    IR->>DataFlow: Pass Data for Transformation
    DataFlow->>DataFlow: Apply Transformations
    DataFlow->>IR: Return Transformed Data
    IR->>Destination: Write Data
    Destination->>Pipeline: Confirm Success

🔄 Integration Patterns¶

Pattern 1: Extract-Load-Transform (ELT)¶

Load raw data first, then transform in the destination system.

Benefits:

Preserve raw data for reprocessing
Leverage destination compute power
Faster initial data loading

When to Use:

Working with cloud data warehouses (Synapse, Snowflake)
Need to preserve raw data
Transformations are complex and compute-intensive

Pattern 2: Extract-Transform-Load (ETL)¶

Transform data during the movement process.

Benefits:

Reduced storage requirements
Data quality validation before loading
Pre-aggregated data

When to Use:

Simple transformations
Limited destination storage
Need data validation before loading

Pattern 3: Incremental Loading¶

Load only new or changed data since last execution.

{
  "source": {
    "type": "AzureSqlSource",
    "sqlReaderQuery": "SELECT * FROM Sales WHERE ModifiedDate > '@{pipeline().parameters.LastRunTime}'"
  }
}

Benefits:

Reduced data transfer
Lower processing costs
Faster execution times

Pattern 4: Event-Driven Processing¶

Trigger pipelines based on data arrival or business events.

Components:

Event Grid for file arrival detection
Storage Event Triggers
Service Bus for business events

💼 Use Cases¶

Use Case 1: Data Warehouse Migration¶

Scenario: Migrate on-premises SQL Server data warehouse to Azure Synapse.

ADF Solution:

Self-hosted IR for secure connectivity
Incremental copy activities for large tables
Mapping data flows for transformation logic
Scheduling triggers for automated sync

Use Case 2: Multi-Cloud Data Integration¶

Scenario: Consolidate data from AWS, GCP, and Azure into a unified data lake.

ADF Solution:

Amazon S3 and Google Cloud Storage connectors
Azure Data Lake Storage Gen2 as destination
Data flows for standardization and cleansing
Azure Purview for metadata management

Use Case 3: Real-Time Analytics Pipeline¶

Scenario: Process streaming IoT data for real-time dashboards.

ADF Solution:

Event Hub or IoT Hub as source
Stream Analytics for real-time processing
ADF for batch aggregation and archival
Power BI for visualization

Use Case 4: SaaS Data Consolidation¶

Scenario: Integrate data from Salesforce, Dynamics 365, and ServiceNow.

ADF Solution:

Built-in SaaS connectors
Scheduled triggers for daily sync
Data flows for normalization
Azure SQL Database for consolidated view

✅ When to Use ADF¶

ADF is Ideal For:¶

Cloud-Native Data Integration: Moving data between Azure services
Hybrid Scenarios: Connecting on-premises and cloud data sources
Complex Orchestration: Multi-step workflows with dependencies
Scalable Processing: Large-volume data movement and transformation
Managed Service Benefits: Minimal infrastructure management

Consider Alternatives When:¶

Real-Time Streaming: Use Azure Stream Analytics or Event Hubs
Complex Transformations: Consider Azure Databricks or Synapse Spark
Simple File Transfers: AzCopy or Storage Explorer might suffice
Code-First Development: Azure Functions or custom applications

🎯 Key Concepts Summary¶

Concept	Description	Example
Pipeline	Logical workflow container	Daily sales data processing
Activity	Processing step	Copy data from blob to SQL
Dataset	Data reference	Customer table in SQL database
Linked Service	Connection definition	Azure SQL Database connection
Integration Runtime	Compute infrastructure	Self-hosted IR for on-premises access
Trigger	Execution initiator	Daily at 2 AM schedule
Parameter	Dynamic configuration	Date range for data extraction

📚 Additional Resources¶

🎯 Knowledge Check¶

Before proceeding, ensure you understand:

What Azure Data Factory is and its core capabilities
The six main components of ADF (Pipeline, Activity, Dataset, Linked Service, IR, Trigger)
Difference between ETL and ELT patterns
When to use Azure IR vs Self-Hosted IR
Common ADF use cases and integration patterns

🚀 Next Steps¶

Now that you understand ADF fundamentals, proceed to:

→ 02. Environment Setup - Create and configure your first Data Factory

Module Progress: 1 of 18 complete

Tutorial Version: 1.0 Last Updated: January 2025