🎯 Service Selection Guide¶

Comprehensive decision trees and guidance for selecting the right Azure services for your Cloud Scale Analytics solution.

🎯 Purpose¶

Choosing the right Azure services is critical for building successful analytics solutions. This guide provides decision trees, comparison matrices, and practical guidance to help you select the optimal services for your specific requirements.

🌳 Master Service Selection Decision Tree¶

flowchart TD
    Start([🎯 Start: Select Azure Services])

    Start --> Purpose{What is your<br/>primary purpose?}

    Purpose -->|Data Processing| Processing{Processing<br/>Type?}
    Purpose -->|Data Storage| Storage{Storage<br/>Type?}
    Purpose -->|Data Movement| Movement{Movement<br/>Pattern?}
    Purpose -->|Data Streaming| Streaming{Streaming<br/>Need?}

    Processing -->|Batch Analytics| BatchType{Data<br/>Volume?}
    Processing -->|Real-time| RTType{Latency<br/>Requirement?}
    Processing -->|Machine Learning| MLType{ML<br/>Maturity?}

    BatchType -->|< 10TB| SynapseDedicated[Synapse Dedicated SQL]
    BatchType -->|10-100TB| SynapseChoice{Query<br/>Pattern?}
    BatchType -->|> 100TB| SynapseServerless[Synapse Serverless SQL]

    SynapseChoice -->|Ad-hoc queries| SynapseServerless
    SynapseChoice -->|Consistent workload| SynapseDedicated

    RTType -->|< 1 second| EventHubs[Event Hubs +<br/>Stream Analytics]
    RTType -->|1-10 seconds| StreamAnalytics[Stream Analytics]

    MLType -->|Starting out| SynapseSpark[Synapse Spark Pools]
    MLType -->|Advanced/Production| Databricks[Azure Databricks]

    Storage -->|Structured| StructuredType{ACID<br/>Requirements?}
    Storage -->|Semi-structured| SemiType{Scale?}
    Storage -->|Unstructured| DataLake[Data Lake Gen2]

    StructuredType -->|Yes| AzureSQL[Azure SQL Database]
    StructuredType -->|Analytical| SynapseDedicated

    SemiType -->|Global distribution| CosmosDB[Cosmos DB]
    SemiType -->|Regional| DataLake

    Movement -->|Cloud to Cloud| ADF[Azure Data Factory]
    Movement -->|On-prem to Cloud| ADFHybrid[Data Factory +<br/>Integration Runtime]
    Movement -->|Event-driven| EventGrid[Event Grid]

    Streaming -->|High throughput| EventHubs
    Streaming -->|Event routing| EventGrid
    Streaming -->|Processing| StreamAnalytics

    style Start fill:#e1f5fe
    style SynapseDedicated fill:#fff3e0
    style SynapseServerless fill:#e8f5e9
    style SynapseSpark fill:#fce4ec
    style Databricks fill:#f3e5f5
    style EventHubs fill:#e0f2f1
    style StreamAnalytics fill:#fff9c4
    style AzureSQL fill:#e8eaf6
    style CosmosDB fill:#f3e5f5
    style DataLake fill:#e8f5e9
    style ADF fill:#fff3e0
    style EventGrid fill:#fce4ec

🔄 Analytics Compute Service Selection¶

Decision Matrix: Which Compute Engine?¶

flowchart TD
    Start([Choose Analytics Compute])

    Start --> Q1{What language/skill<br/>does your team prefer?}

    Q1 -->|SQL Primary| SQLPath{Workload<br/>Type?}
    Q1 -->|Python/Scala/Spark| SparkPath{Use<br/>Case?}
    Q1 -->|Multiple Languages| FlexPath{Enterprise<br/>Features?}

    SQLPath -->|Interactive queries| ServerlessSQL[Synapse Serverless SQL]
    SQLPath -->|Consistent workload| DedicatedSQL[Synapse Dedicated SQL]

    SparkPath -->|Data Engineering| EngineeringChoice{Budget?}
    SparkPath -->|Data Science/ML| ScienceChoice{Collaboration<br/>Needs?}

    EngineeringChoice -->|Cost-optimized| HDInsight[HDInsight Spark]
    EngineeringChoice -->|Performance-optimized| SynapseSpark[Synapse Spark]

    ScienceChoice -->|High collaboration| Databricks[Azure Databricks]
    ScienceChoice -->|Integrated workspace| SynapseSpark

    FlexPath -->|Yes, unified workspace| Synapse[Synapse Analytics<br/>All Engines]
    FlexPath -->|ML-focused| Databricks

    style ServerlessSQL fill:#e1f5fe
    style DedicatedSQL fill:#e8f5e9
    style SynapseSpark fill:#fff3e0
    style Databricks fill:#f3e5f5
    style HDInsight fill:#fce4ec
    style Synapse fill:#fff9c4

Detailed Comparison¶

Feature	Synapse Dedicated SQL	Synapse Serverless SQL	Synapse Spark	Databricks	HDInsight
Primary Language	T-SQL	T-SQL	Python, Scala, Spark SQL	Python, Scala, R, SQL	Multiple
Pricing Model	Reserved capacity	Pay-per-query	Pay-per-use	Compute + DBU	VM-based
Best For	DW workloads	Ad-hoc queries	Data engineering	ML workflows	Migration
Auto-scaling	Manual	Automatic	Automatic	Automatic	Manual
Startup Time	Always on	Instant	2-5 minutes	5-10 minutes	10-20 minutes
ML Integration	Limited	None	Built-in	Advanced (MLflow)	Custom
Cost (Small)
Cost (Large)

Use Case to Service Mapping¶

Enterprise Data Warehousing¶

Primary Service: Synapse Dedicated SQL Pools

Configuration Guidance:

Small (<1TB): DW100c - DW500c
Medium (1-10TB): DW500c - DW1000c
Large (>10TB): DW1000c+ with partitioning

Alternative: Synapse Serverless SQL for cost-sensitive scenarios

Big Data Processing¶

Primary Service: Synapse Spark Pools or Azure Databricks

Decision Criteria:

flowchart LR
    BigData([Big Data Processing])

    BigData --> Integration{Need tight<br/>integration with<br/>SQL workloads?}

    Integration -->|Yes| SynapseSpark[Synapse Spark Pools]
    Integration -->|No| Collaboration{High<br/>collaboration<br/>needs?}

    Collaboration -->|Yes| Databricks[Azure Databricks]
    Collaboration -->|No| Budget{Budget<br/>Constraint?}

    Budget -->|Cost-optimized| HDInsight[HDInsight]
    Budget -->|Performance-optimized| SynapseSpark

    style SynapseSpark fill:#fff3e0
    style Databricks fill:#f3e5f5
    style HDInsight fill:#e8f5e9

Machine Learning Workloads¶

Primary Service: Azure Databricks

When to Use:

Advanced ML workflows
MLOps requirements
Multi-language data science teams
Collaborative notebook environment

Alternative: Synapse Spark Pools when:

ML is secondary to analytics
Need unified workspace
Simpler ML requirements

🗄️ Storage Service Selection¶

Storage Decision Tree¶

flowchart TD
    Start([Choose Storage Service])

    Start --> DataType{Data<br/>Structure?}

    DataType -->|Structured/Relational| Relational{Transactional<br/>or Analytical?}
    DataType -->|Semi-structured| SemiStructured{Access<br/>Pattern?}
    DataType -->|Unstructured/Files| Files{Analytics<br/>Workload?}

    Relational -->|Transactional (OLTP)| AzureSQL[Azure SQL Database]
    Relational -->|Analytical (OLAP)| SynapseDedicated[Synapse Dedicated SQL]

    SemiStructured -->|Key-value, document| Global{Global<br/>Distribution?}
    SemiStructured -->|Time-series| TimeSeriesDB[Data Explorer]

    Global -->|Yes| CosmosDB[Cosmos DB]
    Global -->|No| JsonChoice{Query<br/>Complexity?}

    JsonChoice -->|Simple| BlobStorage[Blob Storage]
    JsonChoice -->|Complex| CosmosDB

    Files -->|Yes, big data| ADLS[Data Lake Gen2]
    Files -->|No, general storage| BlobStorage

    style AzureSQL fill:#e8eaf6
    style SynapseDedicated fill:#fff3e0
    style CosmosDB fill:#f3e5f5
    style ADLS fill:#e8f5e9
    style BlobStorage fill:#e0f2f1
    style TimeSeriesDB fill:#fce4ec

Storage Service Comparison¶

Service	Data Model	Scale	Latency	ACID	Best Use Case
Data Lake Gen2	Hierarchical files	Unlimited	High	No (Delta Lake adds)	Big data analytics, data lakes
Cosmos DB	Multi-model	Very high	Very low	Yes	Global apps, real-time
Azure SQL	Relational	High	Low	Yes	Transactional apps
Blob Storage	Object storage	Unlimited	Medium	No	General purpose, archives
Data Explorer	Time-series	Very high	Very low	No	IoT, logs, telemetry

Storage Selection by Workload¶

Data Lake Foundation¶

Recommended: Data Lake Storage Gen2

Key Features:

Hierarchical namespace for efficient data organization
Fine-grained access control with ACLs
Optimized for analytics workloads
Native integration with Synapse and Databricks

Configuration:

graph TB
    subgraph "Data Lake Zones"
        Raw[Raw Zone<br/>Landing area]
        Curated[Curated Zone<br/>Processed data]
        Consumption[Consumption Zone<br/>Business-ready]
    end

    subgraph "Access Tiers"
        Hot[Hot Tier<br/>Frequent access]
        Cool[Cool Tier<br/>Infrequent access]
        Archive[Archive Tier<br/>Long-term storage]
    end

    Raw --> Hot
    Curated --> Hot
    Curated --> Cool
    Consumption --> Hot
    Raw --> Archive

    style Raw fill:#ffebee
    style Curated fill:#e3f2fd
    style Consumption fill:#e8f5e9

Real-time Applications¶

Recommended: Cosmos DB

When to Choose:

Global distribution required
Low latency (<10ms) needed
Multi-model data (document, graph, key-value)
Elastic scale required

API Selection:

API	Use Case	Best For
SQL (Core)	General purpose	JSON documents, flexible queries
MongoDB	MongoDB compatibility	Migration from MongoDB
Cassandra	Wide-column	Time-series, IoT data
Gremlin	Graph data	Social networks, recommendations
Table	Key-value	Simple lookups, high throughput

🔄 Streaming Service Selection¶

Streaming Service Decision Tree¶

flowchart TD
    Start([Choose Streaming Service])

    Start --> Purpose{Primary<br/>Purpose?}

    Purpose -->|Ingestion| Throughput{Expected<br/>Throughput?}
    Purpose -->|Processing| ProcessType{Processing<br/>Complexity?}
    Purpose -->|Routing| RoutingType{Event<br/>Distribution?}

    Throughput -->|Very High<br/>>1M events/sec| EventHubsDedicated[Event Hubs<br/>Dedicated]
    Throughput -->|High<br/>100K-1M/sec| EventHubsStandard[Event Hubs<br/>Standard]
    Throughput -->|Medium<br/><100K/sec| EventHubsBasic[Event Hubs<br/>Basic]

    ProcessType -->|SQL-based| StreamAnalytics[Stream Analytics]
    ProcessType -->|Complex/Code| SparkStreaming[Synapse Spark<br/>Structured Streaming]

    RoutingType -->|Many subscribers| EventGrid[Event Grid]
    RoutingType -->|Few subscribers| EventHubs[Event Hubs]

    style EventHubsDedicated fill:#e1f5fe
    style EventHubsStandard fill:#e8f5e9
    style EventHubsBasic fill:#fff3e0
    style StreamAnalytics fill:#f3e5f5
    style SparkStreaming fill:#fce4ec
    style EventGrid fill:#fff9c4

Streaming Service Comparison¶

Service	Purpose	Throughput	Latency	Processing
Event Hubs	Event ingestion	Very High	Low	No
Stream Analytics	Stream processing	Medium	Sub-second	SQL-based
Event Grid	Event routing	High	Seconds	No
Spark Streaming	Complex processing	Medium	Seconds	Code-based

Streaming Architecture Patterns¶

IoT Telemetry Processing¶

Recommended Stack:

graph LR
    Devices[IoT Devices] --> IoTHub[IoT Hub]
    IoTHub --> EventHubs[Event Hubs]
    EventHubs --> StreamAnalytics[Stream Analytics]
    StreamAnalytics --> HotPath[Hot Path<br/>Cosmos DB]
    StreamAnalytics --> ColdPath[Cold Path<br/>Data Lake]

    style Devices fill:#e8f5e9
    style EventHubs fill:#e1f5fe
    style StreamAnalytics fill:#fff3e0
    style HotPath fill:#ffebee
    style ColdPath fill:#e8eaf6

Services:

Ingestion: IoT Hub → Event Hubs
Processing: Stream Analytics
Hot Storage: Cosmos DB (real-time queries)
Cold Storage: Data Lake Gen2 (historical analysis)

Event-Driven Microservices¶

Recommended Stack:

graph TB
    Apps[Applications] --> EventGrid[Event Grid]
    EventGrid --> Func[Azure Functions]
    EventGrid --> Logic[Logic Apps]
    EventGrid --> EventHubs[Event Hubs]
    EventHubs --> StreamAnalytics[Stream Analytics]

    style Apps fill:#e8f5e9
    style EventGrid fill:#fff3e0
    style Func fill:#e1f5fe
    style Logic fill:#f3e5f5
    style EventHubs fill:#e8eaf6

Services:

Event Routing: Event Grid
Event Processing: Azure Functions, Logic Apps
Event Store: Event Hubs (if needed)
Analytics: Stream Analytics (for aggregations)

🔧 Orchestration Service Selection¶

Orchestration Decision Tree¶

flowchart TD
    Start([Choose Orchestration])

    Start --> Type{Orchestration<br/>Type?}

    Type -->|Data Movement| DataMovement{Source<br/>Location?}
    Type -->|Workflow Automation| Workflow{Complexity?}
    Type -->|Job Scheduling| Scheduling{Execution<br/>Environment?}

    DataMovement -->|Cloud to Cloud| ADF[Azure Data Factory]
    DataMovement -->|On-prem to Cloud| ADFWithIR[Data Factory +<br/>Self-hosted IR]
    DataMovement -->|Real-time Sync| ChangeDataCapture[Change Data Capture]

    Workflow -->|Simple, Low-code| LogicApps[Logic Apps]
    Workflow -->|Complex, Code-based| ADFPipelines[Data Factory Pipelines]
    Workflow -->|ML Workflows| MLPipelines[Azure ML Pipelines]

    Scheduling -->|Spark Jobs| SynapseNotebooks[Synapse Notebooks]
    Scheduling -->|SQL Jobs| SynapseSQL[Synapse SQL Jobs]
    Scheduling -->|Mixed| ADF

    style ADF fill:#fff3e0
    style LogicApps fill:#e8f5e9
    style SynapseNotebooks fill:#f3e5f5

Orchestration Service Comparison¶

Service	Primary Use	Coding Required	Integration	Best For
Data Factory	Data integration	Optional	Excellent	ETL/ELT pipelines
Logic Apps	Workflow automation	No	300+ connectors	Business workflows
Synapse Pipelines	Analytics orchestration	Optional	Native Synapse	Unified analytics
Azure Functions	Event processing	Yes	Flexible	Custom logic

💰 Cost Optimization Guidance¶

Service Selection by Budget¶

flowchart TD
    Start([Budget Optimization])

    Start --> Priority{Primary<br/>Priority?}

    Priority -->|Lowest Cost| CostOptimized{Workload<br/>Pattern?}
    Priority -->|Best Performance| Performance{Use<br/>Case?}
    Priority -->|Balanced| Balanced{Data<br/>Volume?}

    CostOptimized -->|Sporadic queries| ServerlessSQL[Synapse Serverless SQL<br/>Pay-per-query]
    CostOptimized -->|Batch processing| HDInsight[HDInsight<br/>VM-based pricing]

    Performance -->|Real-time analytics| DedicatedPools[Synapse Dedicated +<br/>Stream Analytics]
    Performance -->|ML at scale| Databricks[Databricks Premium]

    Balanced -->|< 10TB| ServerlessSQL
    Balanced -->|10-100TB| SynapseSpark[Synapse Spark Pools]
    Balanced -->|> 100TB| DataLake[Serverless SQL +<br/>Data Lake Gen2]

    style ServerlessSQL fill:#e8f5e9
    style HDInsight fill:#fff3e0
    style SynapseSpark fill:#f3e5f5
    style Databricks fill:#ffebee
    style DataLake fill:#e1f5fe

Cost Comparison Matrix¶

Service	Pricing Model	Cost Predictability
Synapse Serverless	Pay-per-query	Variable
Synapse Dedicated	Reserved capacity	Predictable
Databricks	Compute + DBU	Variable
HDInsight	VM-based	Predictable
Stream Analytics	Streaming Units	Predictable

🎯 Quick Service Selector¶

By Primary Use Case¶

Use Case	Recommended Services	Alternative Option
Enterprise DW	Synapse Dedicated SQL	Synapse Serverless SQL
Data Lake Analytics	Synapse Spark + Data Lake Gen2	Databricks + Data Lake Gen2
Real-time Dashboards	Stream Analytics + Event Hubs + Power BI	Synapse Spark Streaming
ML/AI Workloads	Databricks	Synapse Spark + Azure ML
IoT Analytics	IoT Hub + Event Hubs + Stream Analytics	Event Hubs + Synapse
Data Integration	Data Factory	Synapse Pipelines
Operational Analytics	Cosmos DB + Synapse Link	Azure SQL + Change Feed

By Team Skills¶

Team Primary Skills	Recommended Stack	Complexity
SQL Developers	Synapse SQL Pools + Data Factory
Data Engineers	Synapse Spark + Data Lake Gen2
Data Scientists	Databricks + MLflow
Full Stack Developers	Event Hubs + Functions + Cosmos DB
Mixed Skills	Synapse Analytics (all engines)

📋 Service Selection Checklist¶

Before You Choose¶

During POC¶

Post-Selection¶

Decision Support¶

Service Catalog - Complete service overview
Architecture Patterns - Pattern selection guidance
Best Practices - Service-specific best practices

Implementation Guides¶

Cost Management¶

💡 Key Recommendations¶

Start with Serverless: For new workloads with uncertain patterns, start with Synapse Serverless SQL to minimize costs while understanding requirements.

Plan for Growth: Choose services that can scale with your needs. Start simple, but ensure your architecture can evolve.

Optimize for Your Team: Select services that match your team's skills and preferences. The best technology is the one your team can effectively operate.

POC Before Committing: Always validate your service selection with a proof of concept using representative data and workloads.

Monitor and Iterate: Service selection isn't final. Regularly review usage patterns and costs, and adjust your architecture as needs evolve.

Last Updated: 2025-01-28 Services Covered: 15+ Decision Trees: 8