🎯 Azure Data Factory Fundamentals¶
Master the foundational concepts of Azure Data Factory (ADF) including architecture, components, and design patterns for enterprise data integration.
📋 Table of Contents¶
- What is Azure Data Factory?
- Core Components
- Architecture Overview
- Integration Patterns
- Use Cases
- When to Use ADF
- Next Steps
🌟 What is Azure Data Factory?¶
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation.
Key Capabilities¶
- Data Movement: Copy data between 90+ supported data sources
- Data Transformation: Transform data using compute services like Azure Databricks, HDInsight, and Synapse
- Orchestration: Build complex workflows with conditional logic and dependencies
- Scheduling: Execute pipelines on-demand or with automated triggers
- Monitoring: Track pipeline execution and performance metrics
ADF vs Traditional ETL Tools¶
| Feature | Traditional ETL | Azure Data Factory |
|---|---|---|
| Infrastructure | On-premises servers | Serverless cloud service |
| Scalability | Manual scaling | Auto-scaling |
| Pricing Model | License + hardware costs | Pay-per-use |
| Maintenance | Manual updates | Managed service |
| Integration | Limited connectors | 90+ native connectors |
| Development | Code-heavy | Visual + code options |
🏗️ Core Components¶
Pipelines¶
A pipeline is a logical grouping of activities that together perform a task.
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [{"referenceName": "BlobDataset"}],
"outputs": [{"referenceName": "SQLDataset"}]
}
]
}
}
Pipeline Characteristics:
- Contain one or more activities
- Can be parameterized for reusability
- Support dependencies and conditional execution
- Can be triggered manually or automatically
Activities¶
Activities represent processing steps within a pipeline.
Activity Categories:
- Data Movement Activities
- Copy Activity: Move data between sources
-
Data Flow: Transform data at scale
-
Data Transformation Activities
- Databricks Notebook
- HDInsight Hive/Pig/Spark
- Stored Procedure
-
Custom Activity
-
Control Activities
- ForEach: Iterate over collections
- If Condition: Conditional branching
- Wait: Add delays
- Web Activity: Call REST APIs
Datasets¶
Datasets identify data within data stores.
{
"name": "AzureSQLDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureSQLLinkedService",
"type": "LinkedServiceReference"
},
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "dbo.Customer"
}
}
}
Dataset Properties:
- Schema: Structure of the data
- Location: Where the data resides
- Format: Parquet, CSV, JSON, Avro, etc.
- Partitioning: How data is organized
Linked Services¶
Linked services define connection information to data sources.
Types of Linked Services:
- Data Stores: Azure Blob, Azure SQL, On-premises SQL Server
- Compute Services: Azure Databricks, HDInsight, Synapse
- Other Services: Azure Key Vault, Azure Function
{
"name": "AzureSQLLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydb;",
"authenticationType": "ServicePrincipal",
"servicePrincipalId": "xxxx-xxxx-xxxx-xxxx",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "sql-password"
},
"tenant": "xxxx-xxxx-xxxx-xxxx"
}
}
}
Integration Runtime¶
The Integration Runtime (IR) provides the compute infrastructure for data movement and transformation.
IR Types:
- Azure Integration Runtime
- Serverless compute
- Data movement between cloud data stores
-
Dispatch activities to compute services
-
Self-Hosted Integration Runtime
- Installed on on-premises or VM
- Access data behind firewalls
-
Private network connectivity
-
Azure-SSIS Integration Runtime
- Execute SSIS packages in the cloud
- Lift-and-shift SSIS workloads
Triggers¶
Triggers determine when pipeline execution should start.
Trigger Types:
| Trigger Type | Description | Use Case |
|---|---|---|
| Schedule | Time-based execution | Daily batch processing |
| Tumbling Window | Fixed-size, non-overlapping intervals | Hourly aggregations |
| Event-Based | Responds to events (file arrival) | Real-time processing |
| Manual | On-demand execution | Ad-hoc processing |
🎨 Architecture Overview¶
High-Level Architecture¶
graph TB
subgraph "Data Sources"
A[On-Premises DBs]
B[SaaS Applications]
C[File Systems]
D[Cloud Storage]
end
subgraph "Azure Data Factory"
E[Self-Hosted IR]
F[Azure IR]
G[Pipeline Orchestration]
H[Mapping Data Flows]
end
subgraph "Destinations"
I[Azure Synapse]
J[Azure SQL DB]
K[Data Lake]
L[Power BI]
end
A --> E
B --> F
C --> E
D --> F
E --> G
F --> G
G --> H
H --> I
H --> J
H --> K
K --> L Data Factory Workflow¶
sequenceDiagram
participant Source
participant IR as Integration Runtime
participant Pipeline
participant DataFlow
participant Destination
Pipeline->>IR: Start Execution
IR->>Source: Connect & Read Data
Source->>IR: Return Data
IR->>DataFlow: Pass Data for Transformation
DataFlow->>DataFlow: Apply Transformations
DataFlow->>IR: Return Transformed Data
IR->>Destination: Write Data
Destination->>Pipeline: Confirm Success 🔄 Integration Patterns¶
Pattern 1: Extract-Load-Transform (ELT)¶
Load raw data first, then transform in the destination system.
Benefits:
- Preserve raw data for reprocessing
- Leverage destination compute power
- Faster initial data loading
When to Use:
- Working with cloud data warehouses (Synapse, Snowflake)
- Need to preserve raw data
- Transformations are complex and compute-intensive
Pattern 2: Extract-Transform-Load (ETL)¶
Transform data during the movement process.
Benefits:
- Reduced storage requirements
- Data quality validation before loading
- Pre-aggregated data
When to Use:
- Simple transformations
- Limited destination storage
- Need data validation before loading
Pattern 3: Incremental Loading¶
Load only new or changed data since last execution.
{
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM Sales WHERE ModifiedDate > '@{pipeline().parameters.LastRunTime}'"
}
}
Benefits:
- Reduced data transfer
- Lower processing costs
- Faster execution times
Pattern 4: Event-Driven Processing¶
Trigger pipelines based on data arrival or business events.
Components:
- Event Grid for file arrival detection
- Storage Event Triggers
- Service Bus for business events
💼 Use Cases¶
Use Case 1: Data Warehouse Migration¶
Scenario: Migrate on-premises SQL Server data warehouse to Azure Synapse.
ADF Solution:
- Self-hosted IR for secure connectivity
- Incremental copy activities for large tables
- Mapping data flows for transformation logic
- Scheduling triggers for automated sync
Use Case 2: Multi-Cloud Data Integration¶
Scenario: Consolidate data from AWS, GCP, and Azure into a unified data lake.
ADF Solution:
- Amazon S3 and Google Cloud Storage connectors
- Azure Data Lake Storage Gen2 as destination
- Data flows for standardization and cleansing
- Azure Purview for metadata management
Use Case 3: Real-Time Analytics Pipeline¶
Scenario: Process streaming IoT data for real-time dashboards.
ADF Solution:
- Event Hub or IoT Hub as source
- Stream Analytics for real-time processing
- ADF for batch aggregation and archival
- Power BI for visualization
Use Case 4: SaaS Data Consolidation¶
Scenario: Integrate data from Salesforce, Dynamics 365, and ServiceNow.
ADF Solution:
- Built-in SaaS connectors
- Scheduled triggers for daily sync
- Data flows for normalization
- Azure SQL Database for consolidated view
✅ When to Use ADF¶
ADF is Ideal For:¶
- Cloud-Native Data Integration: Moving data between Azure services
- Hybrid Scenarios: Connecting on-premises and cloud data sources
- Complex Orchestration: Multi-step workflows with dependencies
- Scalable Processing: Large-volume data movement and transformation
- Managed Service Benefits: Minimal infrastructure management
Consider Alternatives When:¶
- Real-Time Streaming: Use Azure Stream Analytics or Event Hubs
- Complex Transformations: Consider Azure Databricks or Synapse Spark
- Simple File Transfers: AzCopy or Storage Explorer might suffice
- Code-First Development: Azure Functions or custom applications
🎯 Key Concepts Summary¶
| Concept | Description | Example |
|---|---|---|
| Pipeline | Logical workflow container | Daily sales data processing |
| Activity | Processing step | Copy data from blob to SQL |
| Dataset | Data reference | Customer table in SQL database |
| Linked Service | Connection definition | Azure SQL Database connection |
| Integration Runtime | Compute infrastructure | Self-hosted IR for on-premises access |
| Trigger | Execution initiator | Daily at 2 AM schedule |
| Parameter | Dynamic configuration | Date range for data extraction |
📚 Additional Resources¶
🎯 Knowledge Check¶
Before proceeding, ensure you understand:
- What Azure Data Factory is and its core capabilities
- The six main components of ADF (Pipeline, Activity, Dataset, Linked Service, IR, Trigger)
- Difference between ETL and ELT patterns
- When to use Azure IR vs Self-Hosted IR
- Common ADF use cases and integration patterns
🚀 Next Steps¶
Now that you understand ADF fundamentals, proceed to:
→ 02. Environment Setup - Create and configure your first Data Factory
Module Progress: 1 of 18 complete
Tutorial Version: 1.0 Last Updated: January 2025