🗺️ Data Flow Interactive Designer¶
📋 Overview¶
Interactive visual designer for creating Azure Synapse Mapping Data Flows. Build complex ETL transformations using a code-free, drag-and-drop interface with real-time data preview and optimization suggestions.
Duration: 45-60 minutes Format: Visual transformation designer Prerequisites: Understanding of ETL concepts and data transformations
🎯 Learning Objectives¶
- Design data transformations visually without code
- Implement common ETL patterns (SCD, deduplication, aggregations)
- Optimize data flows for performance
- Debug and preview data at each transformation step
- Export data flow definitions for deployment
- Understand Spark execution behind data flows
🚀 Prerequisites and Setup¶
Access Requirements¶
- Browser-Based: HTML5-compatible browser
- Sample Data: Pre-loaded datasets for testing
- Real-Time Preview: Data sampling at each step
- Validation: Automatic schema and logic checking
Transformation Components¶
const transformationComponents = {
sources: ['Azure SQL', 'Data Lake', 'Blob Storage', 'Cosmos DB'],
destinations: ['Azure SQL', 'Data Lake', 'Synapse SQL Pool'],
rowModifier: ['Filter', 'Select', 'Derived Column', 'Surrogate Key'],
schemaModifier: ['Pivot', 'Unpivot', 'Flatten', 'Parse'],
multipleInputs: ['Join', 'Union', 'Exists', 'Lookup'],
rowset: ['Aggregate', 'Sort', 'Window', 'Rank'],
formatters: ['Stringify', 'Flatten', 'Assert']
};
🎨 Designer Interface¶
Transformation Patterns¶
Pattern 1: Slowly Changing Dimension (Type 2)¶
{
"name": "SCD_Type2_Customer",
"transformations": [
{
"name": "SourceCustomers",
"type": "source",
"dataset": "CustomerDelta"
},
{
"name": "CurrentDimension",
"type": "source",
"dataset": "DimCustomer"
},
{
"name": "DeriveHashKey",
"type": "derivedColumn",
"columns": [
{
"name": "HashKey",
"expression": "sha2(256, concat(CustomerName, Address, City))"
}
]
},
{
"name": "Lookup",
"type": "lookup",
"leftStream": "DeriveHashKey",
"rightStream": "CurrentDimension",
"lookupConditions": ["CustomerID == CustomerID"],
"multiple": false
},
{
"name": "DeriveAction",
"type": "derivedColumn",
"columns": [
{
"name": "Action",
"expression": "iif(isNull(DimCustomerID), 'INSERT', iif(HashKey != CurrentHashKey, 'UPDATE', 'NOCHANGE'))"
}
]
},
{
"name": "FilterChanges",
"type": "filter",
"condition": "Action != 'NOCHANGE'"
},
{
"name": "SetDates",
"type": "derivedColumn",
"columns": [
{
"name": "EffectiveDate",
"expression": "currentTimestamp()"
},
{
"name": "IsActive",
"expression": "true()"
}
]
},
{
"name": "SinkDimension",
"type": "sink",
"dataset": "DimCustomer",
"updateMethod": "upsert",
"keys": ["CustomerID"]
}
]
}
Pattern 2: Deduplication¶
{
"name": "DeduplicateRecords",
"transformations": [
{
"name": "Source",
"type": "source"
},
{
"name": "Window",
"type": "window",
"over": "CustomerID",
"orderBy": "LastModifiedDate desc",
"windowColumns": [
{
"name": "RowNumber",
"windowFunction": "rowNumber()"
}
]
},
{
"name": "FilterFirst",
"type": "filter",
"condition": "RowNumber == 1"
},
{
"name": "Sink",
"type": "sink"
}
]
}
📚 Tutorial Examples¶
Tutorial 1: Basic Transformations¶
Step 1: Filter and Select
// Visual configuration translates to:
df = df.filter(col("Revenue") > 1000)
.select("CustomerID", "OrderDate", "Revenue");
Step 2: Derived Columns
// Add calculated fields
df = df.withColumn("TaxAmount", col("Revenue") * 0.08)
.withColumn("TotalWithTax", col("Revenue") + col("TaxAmount"));
Step 3: Aggregate
// Group and aggregate
df = df.groupBy("CustomerID", "OrderMonth")
.agg(
sum("Revenue").alias("TotalRevenue"),
count("*").alias("OrderCount")
);
Tutorial 2: Advanced Joins¶
Multiple Join Types:
{
"joins": [
{
"type": "inner",
"left": "Orders",
"right": "Customers",
"condition": "Orders.CustomerID == Customers.CustomerID"
},
{
"type": "left",
"left": "Orders",
"right": "Products",
"condition": "Orders.ProductID == Products.ProductID"
}
]
}
💡 Performance Optimization Features¶
Auto-Optimization Suggestions¶
const optimizationRules = {
partitioning: {
rule: 'Large dataset without partitioning',
suggestion: 'Add hash partitioning on join keys',
impact: 'High - 40-60% improvement'
},
broadcasting: {
rule: 'Small dimension table in join',
suggestion: 'Enable broadcast join',
impact: 'High - Eliminates shuffle'
},
caching: {
rule: 'Data flow reused multiple times',
suggestion: 'Enable sink caching',
impact: 'Medium - Reduces recomputation'
}
};
Data Preview and Debugging¶
// Preview data at any transformation step
const previewData = {
sampleSize: 100,
refreshInterval: 5000, // 5 seconds
showSchema: true,
showStatistics: true
};
// Debug mode features
const debugMode = {
rowLimits: true,
maxRows: 1000,
enableLogging: true,
showExecutionPlan: true
};
🔧 Troubleshooting¶
Common Issues¶
Issue: Schema Mismatch
// Auto-fix suggestions
const schemaFix = {
issue: 'Column type mismatch in join',
source1: 'CustomerID (string)',
source2: 'CustomerID (int)',
suggestion: 'Add cast transformation',
autoFix: 'toInteger(CustomerID)'
};
Issue: Performance Bottleneck
// Identify bottlenecks
const performanceAnalysis = {
slowStage: 'Join operation',
cause: 'Data skew on join key',
solution: 'Add salt key for distribution',
implementation: 'concat(JoinKey, "_", rand() % 10)'
};
🔗 Embedded Demo Link¶
Launch Data Flow Designer: https://demos.csa-inabox.com/dataflow-designer
📚 Additional Resources¶
💬 Feedback¶
💡 Rate the Data Flow Designer experience
- ✅ Built successful data flows - Share feedback
- ⚠️ Encountered issues - Report problem
- 💡 Feature request - Suggest improvement
Last Updated: January 2025 | Version: 1.0.0