🎯 Service Selection Guide¶
Comprehensive decision trees and guidance for selecting the right Azure services for your Cloud Scale Analytics solution.
🎯 Purpose¶
Choosing the right Azure services is critical for building successful analytics solutions. This guide provides decision trees, comparison matrices, and practical guidance to help you select the optimal services for your specific requirements.
🌳 Master Service Selection Decision Tree¶
flowchart TD
Start([🎯 Start: Select Azure Services])
Start --> Purpose{What is your<br/>primary purpose?}
Purpose -->|Data Processing| Processing{Processing<br/>Type?}
Purpose -->|Data Storage| Storage{Storage<br/>Type?}
Purpose -->|Data Movement| Movement{Movement<br/>Pattern?}
Purpose -->|Data Streaming| Streaming{Streaming<br/>Need?}
Processing -->|Batch Analytics| BatchType{Data<br/>Volume?}
Processing -->|Real-time| RTType{Latency<br/>Requirement?}
Processing -->|Machine Learning| MLType{ML<br/>Maturity?}
BatchType -->|< 10TB| SynapseDedicated[Synapse Dedicated SQL]
BatchType -->|10-100TB| SynapseChoice{Query<br/>Pattern?}
BatchType -->|> 100TB| SynapseServerless[Synapse Serverless SQL]
SynapseChoice -->|Ad-hoc queries| SynapseServerless
SynapseChoice -->|Consistent workload| SynapseDedicated
RTType -->|< 1 second| EventHubs[Event Hubs +<br/>Stream Analytics]
RTType -->|1-10 seconds| StreamAnalytics[Stream Analytics]
MLType -->|Starting out| SynapseSpark[Synapse Spark Pools]
MLType -->|Advanced/Production| Databricks[Azure Databricks]
Storage -->|Structured| StructuredType{ACID<br/>Requirements?}
Storage -->|Semi-structured| SemiType{Scale?}
Storage -->|Unstructured| DataLake[Data Lake Gen2]
StructuredType -->|Yes| AzureSQL[Azure SQL Database]
StructuredType -->|Analytical| SynapseDedicated
SemiType -->|Global distribution| CosmosDB[Cosmos DB]
SemiType -->|Regional| DataLake
Movement -->|Cloud to Cloud| ADF[Azure Data Factory]
Movement -->|On-prem to Cloud| ADFHybrid[Data Factory +<br/>Integration Runtime]
Movement -->|Event-driven| EventGrid[Event Grid]
Streaming -->|High throughput| EventHubs
Streaming -->|Event routing| EventGrid
Streaming -->|Processing| StreamAnalytics
style Start fill:#e1f5fe
style SynapseDedicated fill:#fff3e0
style SynapseServerless fill:#e8f5e9
style SynapseSpark fill:#fce4ec
style Databricks fill:#f3e5f5
style EventHubs fill:#e0f2f1
style StreamAnalytics fill:#fff9c4
style AzureSQL fill:#e8eaf6
style CosmosDB fill:#f3e5f5
style DataLake fill:#e8f5e9
style ADF fill:#fff3e0
style EventGrid fill:#fce4ec 🔄 Analytics Compute Service Selection¶
Decision Matrix: Which Compute Engine?¶
flowchart TD
Start([Choose Analytics Compute])
Start --> Q1{What language/skill<br/>does your team prefer?}
Q1 -->|SQL Primary| SQLPath{Workload<br/>Type?}
Q1 -->|Python/Scala/Spark| SparkPath{Use<br/>Case?}
Q1 -->|Multiple Languages| FlexPath{Enterprise<br/>Features?}
SQLPath -->|Interactive queries| ServerlessSQL[Synapse Serverless SQL]
SQLPath -->|Consistent workload| DedicatedSQL[Synapse Dedicated SQL]
SparkPath -->|Data Engineering| EngineeringChoice{Budget?}
SparkPath -->|Data Science/ML| ScienceChoice{Collaboration<br/>Needs?}
EngineeringChoice -->|Cost-optimized| HDInsight[HDInsight Spark]
EngineeringChoice -->|Performance-optimized| SynapseSpark[Synapse Spark]
ScienceChoice -->|High collaboration| Databricks[Azure Databricks]
ScienceChoice -->|Integrated workspace| SynapseSpark
FlexPath -->|Yes, unified workspace| Synapse[Synapse Analytics<br/>All Engines]
FlexPath -->|ML-focused| Databricks
style ServerlessSQL fill:#e1f5fe
style DedicatedSQL fill:#e8f5e9
style SynapseSpark fill:#fff3e0
style Databricks fill:#f3e5f5
style HDInsight fill:#fce4ec
style Synapse fill:#fff9c4 Detailed Comparison¶
| Feature | Synapse Dedicated SQL | Synapse Serverless SQL | Synapse Spark | Databricks | HDInsight |
|---|---|---|---|---|---|
| Primary Language | T-SQL | T-SQL | Python, Scala, Spark SQL | Python, Scala, R, SQL | Multiple |
| Pricing Model | Reserved capacity | Pay-per-query | Pay-per-use | Compute + DBU | VM-based |
| Best For | DW workloads | Ad-hoc queries | Data engineering | ML workflows | Migration |
| Auto-scaling | Manual | Automatic | Automatic | Automatic | Manual |
| Startup Time | Always on | Instant | 2-5 minutes | 5-10 minutes | 10-20 minutes |
| ML Integration | Limited | None | Built-in | Advanced (MLflow) | Custom |
| Cost (Small) | |||||
| Cost (Large) |
Use Case to Service Mapping¶
Enterprise Data Warehousing¶
Primary Service: Synapse Dedicated SQL Pools
Configuration Guidance:
- Small (<1TB): DW100c - DW500c
- Medium (1-10TB): DW500c - DW1000c
- Large (>10TB): DW1000c+ with partitioning
Alternative: Synapse Serverless SQL for cost-sensitive scenarios
Big Data Processing¶
Primary Service: Synapse Spark Pools or Azure Databricks
Decision Criteria:
flowchart LR
BigData([Big Data Processing])
BigData --> Integration{Need tight<br/>integration with<br/>SQL workloads?}
Integration -->|Yes| SynapseSpark[Synapse Spark Pools]
Integration -->|No| Collaboration{High<br/>collaboration<br/>needs?}
Collaboration -->|Yes| Databricks[Azure Databricks]
Collaboration -->|No| Budget{Budget<br/>Constraint?}
Budget -->|Cost-optimized| HDInsight[HDInsight]
Budget -->|Performance-optimized| SynapseSpark
style SynapseSpark fill:#fff3e0
style Databricks fill:#f3e5f5
style HDInsight fill:#e8f5e9 Machine Learning Workloads¶
Primary Service: Azure Databricks
When to Use:
- Advanced ML workflows
- MLOps requirements
- Multi-language data science teams
- Collaborative notebook environment
Alternative: Synapse Spark Pools when:
- ML is secondary to analytics
- Need unified workspace
- Simpler ML requirements
🗄️ Storage Service Selection¶
Storage Decision Tree¶
flowchart TD
Start([Choose Storage Service])
Start --> DataType{Data<br/>Structure?}
DataType -->|Structured/Relational| Relational{Transactional<br/>or Analytical?}
DataType -->|Semi-structured| SemiStructured{Access<br/>Pattern?}
DataType -->|Unstructured/Files| Files{Analytics<br/>Workload?}
Relational -->|Transactional (OLTP)| AzureSQL[Azure SQL Database]
Relational -->|Analytical (OLAP)| SynapseDedicated[Synapse Dedicated SQL]
SemiStructured -->|Key-value, document| Global{Global<br/>Distribution?}
SemiStructured -->|Time-series| TimeSeriesDB[Data Explorer]
Global -->|Yes| CosmosDB[Cosmos DB]
Global -->|No| JsonChoice{Query<br/>Complexity?}
JsonChoice -->|Simple| BlobStorage[Blob Storage]
JsonChoice -->|Complex| CosmosDB
Files -->|Yes, big data| ADLS[Data Lake Gen2]
Files -->|No, general storage| BlobStorage
style AzureSQL fill:#e8eaf6
style SynapseDedicated fill:#fff3e0
style CosmosDB fill:#f3e5f5
style ADLS fill:#e8f5e9
style BlobStorage fill:#e0f2f1
style TimeSeriesDB fill:#fce4ec Storage Service Comparison¶
| Service | Data Model | Scale | Latency | ACID | Best Use Case |
|---|---|---|---|---|---|
| Data Lake Gen2 | Hierarchical files | Unlimited | High | No (Delta Lake adds) | Big data analytics, data lakes |
| Cosmos DB | Multi-model | Very high | Very low | Yes | Global apps, real-time |
| Azure SQL | Relational | High | Low | Yes | Transactional apps |
| Blob Storage | Object storage | Unlimited | Medium | No | General purpose, archives |
| Data Explorer | Time-series | Very high | Very low | No | IoT, logs, telemetry |
Storage Selection by Workload¶
Data Lake Foundation¶
Recommended: Data Lake Storage Gen2
Key Features:
- Hierarchical namespace for efficient data organization
- Fine-grained access control with ACLs
- Optimized for analytics workloads
- Native integration with Synapse and Databricks
Configuration:
graph TB
subgraph "Data Lake Zones"
Raw[Raw Zone<br/>Landing area]
Curated[Curated Zone<br/>Processed data]
Consumption[Consumption Zone<br/>Business-ready]
end
subgraph "Access Tiers"
Hot[Hot Tier<br/>Frequent access]
Cool[Cool Tier<br/>Infrequent access]
Archive[Archive Tier<br/>Long-term storage]
end
Raw --> Hot
Curated --> Hot
Curated --> Cool
Consumption --> Hot
Raw --> Archive
style Raw fill:#ffebee
style Curated fill:#e3f2fd
style Consumption fill:#e8f5e9 Real-time Applications¶
Recommended: Cosmos DB
When to Choose:
- Global distribution required
- Low latency (<10ms) needed
- Multi-model data (document, graph, key-value)
- Elastic scale required
API Selection:
| API | Use Case | Best For |
|---|---|---|
| SQL (Core) | General purpose | JSON documents, flexible queries |
| MongoDB | MongoDB compatibility | Migration from MongoDB |
| Cassandra | Wide-column | Time-series, IoT data |
| Gremlin | Graph data | Social networks, recommendations |
| Table | Key-value | Simple lookups, high throughput |
🔄 Streaming Service Selection¶
Streaming Service Decision Tree¶
flowchart TD
Start([Choose Streaming Service])
Start --> Purpose{Primary<br/>Purpose?}
Purpose -->|Ingestion| Throughput{Expected<br/>Throughput?}
Purpose -->|Processing| ProcessType{Processing<br/>Complexity?}
Purpose -->|Routing| RoutingType{Event<br/>Distribution?}
Throughput -->|Very High<br/>>1M events/sec| EventHubsDedicated[Event Hubs<br/>Dedicated]
Throughput -->|High<br/>100K-1M/sec| EventHubsStandard[Event Hubs<br/>Standard]
Throughput -->|Medium<br/><100K/sec| EventHubsBasic[Event Hubs<br/>Basic]
ProcessType -->|SQL-based| StreamAnalytics[Stream Analytics]
ProcessType -->|Complex/Code| SparkStreaming[Synapse Spark<br/>Structured Streaming]
RoutingType -->|Many subscribers| EventGrid[Event Grid]
RoutingType -->|Few subscribers| EventHubs[Event Hubs]
style EventHubsDedicated fill:#e1f5fe
style EventHubsStandard fill:#e8f5e9
style EventHubsBasic fill:#fff3e0
style StreamAnalytics fill:#f3e5f5
style SparkStreaming fill:#fce4ec
style EventGrid fill:#fff9c4 Streaming Service Comparison¶
| Service | Purpose | Throughput | Latency | Processing | Complexity |
|---|---|---|---|---|---|
| Event Hubs | Event ingestion | Very High | Low | No | |
| Stream Analytics | Stream processing | Medium | Sub-second | SQL-based | |
| Event Grid | Event routing | High | Seconds | No | |
| Spark Streaming | Complex processing | Medium | Seconds | Code-based |
Streaming Architecture Patterns¶
IoT Telemetry Processing¶
Recommended Stack:
graph LR
Devices[IoT Devices] --> IoTHub[IoT Hub]
IoTHub --> EventHubs[Event Hubs]
EventHubs --> StreamAnalytics[Stream Analytics]
StreamAnalytics --> HotPath[Hot Path<br/>Cosmos DB]
StreamAnalytics --> ColdPath[Cold Path<br/>Data Lake]
style Devices fill:#e8f5e9
style EventHubs fill:#e1f5fe
style StreamAnalytics fill:#fff3e0
style HotPath fill:#ffebee
style ColdPath fill:#e8eaf6 Services:
- Ingestion: IoT Hub → Event Hubs
- Processing: Stream Analytics
- Hot Storage: Cosmos DB (real-time queries)
- Cold Storage: Data Lake Gen2 (historical analysis)
Event-Driven Microservices¶
Recommended Stack:
graph TB
Apps[Applications] --> EventGrid[Event Grid]
EventGrid --> Func[Azure Functions]
EventGrid --> Logic[Logic Apps]
EventGrid --> EventHubs[Event Hubs]
EventHubs --> StreamAnalytics[Stream Analytics]
style Apps fill:#e8f5e9
style EventGrid fill:#fff3e0
style Func fill:#e1f5fe
style Logic fill:#f3e5f5
style EventHubs fill:#e8eaf6 Services:
- Event Routing: Event Grid
- Event Processing: Azure Functions, Logic Apps
- Event Store: Event Hubs (if needed)
- Analytics: Stream Analytics (for aggregations)
🔧 Orchestration Service Selection¶
Orchestration Decision Tree¶
flowchart TD
Start([Choose Orchestration])
Start --> Type{Orchestration<br/>Type?}
Type -->|Data Movement| DataMovement{Source<br/>Location?}
Type -->|Workflow Automation| Workflow{Complexity?}
Type -->|Job Scheduling| Scheduling{Execution<br/>Environment?}
DataMovement -->|Cloud to Cloud| ADF[Azure Data Factory]
DataMovement -->|On-prem to Cloud| ADFWithIR[Data Factory +<br/>Self-hosted IR]
DataMovement -->|Real-time Sync| ChangeDataCapture[Change Data Capture]
Workflow -->|Simple, Low-code| LogicApps[Logic Apps]
Workflow -->|Complex, Code-based| ADFPipelines[Data Factory Pipelines]
Workflow -->|ML Workflows| MLPipelines[Azure ML Pipelines]
Scheduling -->|Spark Jobs| SynapseNotebooks[Synapse Notebooks]
Scheduling -->|SQL Jobs| SynapseSQL[Synapse SQL Jobs]
Scheduling -->|Mixed| ADF
style ADF fill:#fff3e0
style LogicApps fill:#e8f5e9
style SynapseNotebooks fill:#f3e5f5 Orchestration Service Comparison¶
| Service | Primary Use | Complexity | Coding Required | Integration | Best For |
|---|---|---|---|---|---|
| Data Factory | Data integration | Optional | Excellent | ETL/ELT pipelines | |
| Logic Apps | Workflow automation | No | 300+ connectors | Business workflows | |
| Synapse Pipelines | Analytics orchestration | Optional | Native Synapse | Unified analytics | |
| Azure Functions | Event processing | Yes | Flexible | Custom logic |
💰 Cost Optimization Guidance¶
Service Selection by Budget¶
flowchart TD
Start([Budget Optimization])
Start --> Priority{Primary<br/>Priority?}
Priority -->|Lowest Cost| CostOptimized{Workload<br/>Pattern?}
Priority -->|Best Performance| Performance{Use<br/>Case?}
Priority -->|Balanced| Balanced{Data<br/>Volume?}
CostOptimized -->|Sporadic queries| ServerlessSQL[Synapse Serverless SQL<br/>Pay-per-query]
CostOptimized -->|Batch processing| HDInsight[HDInsight<br/>VM-based pricing]
Performance -->|Real-time analytics| DedicatedPools[Synapse Dedicated +<br/>Stream Analytics]
Performance -->|ML at scale| Databricks[Databricks Premium]
Balanced -->|< 10TB| ServerlessSQL
Balanced -->|10-100TB| SynapseSpark[Synapse Spark Pools]
Balanced -->|> 100TB| DataLake[Serverless SQL +<br/>Data Lake Gen2]
style ServerlessSQL fill:#e8f5e9
style HDInsight fill:#fff3e0
style SynapseSpark fill:#f3e5f5
style Databricks fill:#ffebee
style DataLake fill:#e1f5fe Cost Comparison Matrix¶
| Service | Pricing Model | Low Usage Cost | High Usage Cost | Cost Predictability |
|---|---|---|---|---|
| Synapse Serverless | Pay-per-query | Variable | ||
| Synapse Dedicated | Reserved capacity | Predictable | ||
| Databricks | Compute + DBU | Variable | ||
| HDInsight | VM-based | Predictable | ||
| Stream Analytics | Streaming Units | Predictable |
🎯 Quick Service Selector¶
By Primary Use Case¶
| Use Case | Recommended Services | Alternative Option |
|---|---|---|
| Enterprise DW | Synapse Dedicated SQL | Synapse Serverless SQL |
| Data Lake Analytics | Synapse Spark + Data Lake Gen2 | Databricks + Data Lake Gen2 |
| Real-time Dashboards | Stream Analytics + Event Hubs + Power BI | Synapse Spark Streaming |
| ML/AI Workloads | Databricks | Synapse Spark + Azure ML |
| IoT Analytics | IoT Hub + Event Hubs + Stream Analytics | Event Hubs + Synapse |
| Data Integration | Data Factory | Synapse Pipelines |
| Operational Analytics | Cosmos DB + Synapse Link | Azure SQL + Change Feed |
By Team Skills¶
| Team Primary Skills | Recommended Stack | Complexity |
|---|---|---|
| SQL Developers | Synapse SQL Pools + Data Factory | |
| Data Engineers | Synapse Spark + Data Lake Gen2 | |
| Data Scientists | Databricks + MLflow | |
| Full Stack Developers | Event Hubs + Functions + Cosmos DB | |
| Mixed Skills | Synapse Analytics (all engines) |
📋 Service Selection Checklist¶
Before You Choose¶
- Requirements Documented
- Data volume and growth projections
- Latency requirements
- Query patterns and concurrency
- Budget constraints
-
Team skills and preferences
-
Architecture Defined
- Pattern selected (see Architecture Patterns)
- Data flow mapped
- Integration points identified
-
Security requirements documented
-
POC Planned
- Representative data sample prepared
- Key use cases identified for testing
- Success criteria defined
- Timeline established
During POC¶
- Performance Validated
- Query performance meets requirements
- Data loading speed acceptable
- Concurrent user testing completed
-
Scaling behavior verified
-
Costs Estimated
- Development costs projected
- Production costs estimated
- Cost optimization opportunities identified
-
Budget approval secured
-
Operations Assessed
- Monitoring and alerting configured
- Backup and recovery tested
- Security controls validated
- Team trained on operations
Post-Selection¶
- Implementation Roadmap
- Phases defined
- Dependencies mapped
- Resource allocation confirmed
-
Timeline agreed
-
Success Metrics
- KPIs defined
- Monitoring configured
- Regular review scheduled
- Optimization plan created
🔗 Related Resources¶
Decision Support¶
- Service Catalog - Complete service overview
- Architecture Patterns - Pattern selection guidance
- Best Practices - Service-specific best practices
Implementation Guides¶
Cost Management¶
💡 Key Recommendations¶
Start with Serverless: For new workloads with uncertain patterns, start with Synapse Serverless SQL to minimize costs while understanding requirements.
Plan for Growth: Choose services that can scale with your needs. Start simple, but ensure your architecture can evolve.
Optimize for Your Team: Select services that match your team's skills and preferences. The best technology is the one your team can effectively operate.
POC Before Committing: Always validate your service selection with a proof of concept using representative data and workloads.
Monitor and Iterate: Service selection isn't final. Regularly review usage patterns and costs, and adjust your architecture as needs evolve.
Last Updated: 2025-01-28 Services Covered: 15+ Decision Trees: 8