Skip to content

🎯 Service Selection Guide

Status Services Updated

Comprehensive decision trees and guidance for selecting the right Azure services for your Cloud Scale Analytics solution.


🎯 Purpose

Choosing the right Azure services is critical for building successful analytics solutions. This guide provides decision trees, comparison matrices, and practical guidance to help you select the optimal services for your specific requirements.

🌳 Master Service Selection Decision Tree

flowchart TD
    Start([🎯 Start: Select Azure Services])

    Start --> Purpose{What is your<br/>primary purpose?}

    Purpose -->|Data Processing| Processing{Processing<br/>Type?}
    Purpose -->|Data Storage| Storage{Storage<br/>Type?}
    Purpose -->|Data Movement| Movement{Movement<br/>Pattern?}
    Purpose -->|Data Streaming| Streaming{Streaming<br/>Need?}

    Processing -->|Batch Analytics| BatchType{Data<br/>Volume?}
    Processing -->|Real-time| RTType{Latency<br/>Requirement?}
    Processing -->|Machine Learning| MLType{ML<br/>Maturity?}

    BatchType -->|< 10TB| SynapseDedicated[Synapse Dedicated SQL]
    BatchType -->|10-100TB| SynapseChoice{Query<br/>Pattern?}
    BatchType -->|> 100TB| SynapseServerless[Synapse Serverless SQL]

    SynapseChoice -->|Ad-hoc queries| SynapseServerless
    SynapseChoice -->|Consistent workload| SynapseDedicated

    RTType -->|< 1 second| EventHubs[Event Hubs +<br/>Stream Analytics]
    RTType -->|1-10 seconds| StreamAnalytics[Stream Analytics]

    MLType -->|Starting out| SynapseSpark[Synapse Spark Pools]
    MLType -->|Advanced/Production| Databricks[Azure Databricks]

    Storage -->|Structured| StructuredType{ACID<br/>Requirements?}
    Storage -->|Semi-structured| SemiType{Scale?}
    Storage -->|Unstructured| DataLake[Data Lake Gen2]

    StructuredType -->|Yes| AzureSQL[Azure SQL Database]
    StructuredType -->|Analytical| SynapseDedicated

    SemiType -->|Global distribution| CosmosDB[Cosmos DB]
    SemiType -->|Regional| DataLake

    Movement -->|Cloud to Cloud| ADF[Azure Data Factory]
    Movement -->|On-prem to Cloud| ADFHybrid[Data Factory +<br/>Integration Runtime]
    Movement -->|Event-driven| EventGrid[Event Grid]

    Streaming -->|High throughput| EventHubs
    Streaming -->|Event routing| EventGrid
    Streaming -->|Processing| StreamAnalytics

    style Start fill:#e1f5fe
    style SynapseDedicated fill:#fff3e0
    style SynapseServerless fill:#e8f5e9
    style SynapseSpark fill:#fce4ec
    style Databricks fill:#f3e5f5
    style EventHubs fill:#e0f2f1
    style StreamAnalytics fill:#fff9c4
    style AzureSQL fill:#e8eaf6
    style CosmosDB fill:#f3e5f5
    style DataLake fill:#e8f5e9
    style ADF fill:#fff3e0
    style EventGrid fill:#fce4ec

🔄 Analytics Compute Service Selection

Decision Matrix: Which Compute Engine?

flowchart TD
    Start([Choose Analytics Compute])

    Start --> Q1{What language/skill<br/>does your team prefer?}

    Q1 -->|SQL Primary| SQLPath{Workload<br/>Type?}
    Q1 -->|Python/Scala/Spark| SparkPath{Use<br/>Case?}
    Q1 -->|Multiple Languages| FlexPath{Enterprise<br/>Features?}

    SQLPath -->|Interactive queries| ServerlessSQL[Synapse Serverless SQL]
    SQLPath -->|Consistent workload| DedicatedSQL[Synapse Dedicated SQL]

    SparkPath -->|Data Engineering| EngineeringChoice{Budget?}
    SparkPath -->|Data Science/ML| ScienceChoice{Collaboration<br/>Needs?}

    EngineeringChoice -->|Cost-optimized| HDInsight[HDInsight Spark]
    EngineeringChoice -->|Performance-optimized| SynapseSpark[Synapse Spark]

    ScienceChoice -->|High collaboration| Databricks[Azure Databricks]
    ScienceChoice -->|Integrated workspace| SynapseSpark

    FlexPath -->|Yes, unified workspace| Synapse[Synapse Analytics<br/>All Engines]
    FlexPath -->|ML-focused| Databricks

    style ServerlessSQL fill:#e1f5fe
    style DedicatedSQL fill:#e8f5e9
    style SynapseSpark fill:#fff3e0
    style Databricks fill:#f3e5f5
    style HDInsight fill:#fce4ec
    style Synapse fill:#fff9c4

Detailed Comparison

Feature Synapse Dedicated SQL Synapse Serverless SQL Synapse Spark Databricks HDInsight
Primary Language T-SQL T-SQL Python, Scala, Spark SQL Python, Scala, R, SQL Multiple
Pricing Model Reserved capacity Pay-per-query Pay-per-use Compute + DBU VM-based
Best For DW workloads Ad-hoc queries Data engineering ML workflows Migration
Auto-scaling Manual Automatic Automatic Automatic Manual
Startup Time Always on Instant 2-5 minutes 5-10 minutes 10-20 minutes
ML Integration Limited None Built-in Advanced (MLflow) Custom
Cost (Small) High Low Medium Medium Medium
Cost (Large) Medium Medium Medium High Low

Use Case to Service Mapping

Enterprise Data Warehousing

Primary Service: Synapse Dedicated SQL Pools

Configuration Guidance:

  • Small (<1TB): DW100c - DW500c
  • Medium (1-10TB): DW500c - DW1000c
  • Large (>10TB): DW1000c+ with partitioning

Alternative: Synapse Serverless SQL for cost-sensitive scenarios

Big Data Processing

Primary Service: Synapse Spark Pools or Azure Databricks

Decision Criteria:

flowchart LR
    BigData([Big Data Processing])

    BigData --> Integration{Need tight<br/>integration with<br/>SQL workloads?}

    Integration -->|Yes| SynapseSpark[Synapse Spark Pools]
    Integration -->|No| Collaboration{High<br/>collaboration<br/>needs?}

    Collaboration -->|Yes| Databricks[Azure Databricks]
    Collaboration -->|No| Budget{Budget<br/>Constraint?}

    Budget -->|Cost-optimized| HDInsight[HDInsight]
    Budget -->|Performance-optimized| SynapseSpark

    style SynapseSpark fill:#fff3e0
    style Databricks fill:#f3e5f5
    style HDInsight fill:#e8f5e9

Machine Learning Workloads

Primary Service: Azure Databricks

When to Use:

  • Advanced ML workflows
  • MLOps requirements
  • Multi-language data science teams
  • Collaborative notebook environment

Alternative: Synapse Spark Pools when:

  • ML is secondary to analytics
  • Need unified workspace
  • Simpler ML requirements

🗄️ Storage Service Selection

Storage Decision Tree

flowchart TD
    Start([Choose Storage Service])

    Start --> DataType{Data<br/>Structure?}

    DataType -->|Structured/Relational| Relational{Transactional<br/>or Analytical?}
    DataType -->|Semi-structured| SemiStructured{Access<br/>Pattern?}
    DataType -->|Unstructured/Files| Files{Analytics<br/>Workload?}

    Relational -->|Transactional (OLTP)| AzureSQL[Azure SQL Database]
    Relational -->|Analytical (OLAP)| SynapseDedicated[Synapse Dedicated SQL]

    SemiStructured -->|Key-value, document| Global{Global<br/>Distribution?}
    SemiStructured -->|Time-series| TimeSeriesDB[Data Explorer]

    Global -->|Yes| CosmosDB[Cosmos DB]
    Global -->|No| JsonChoice{Query<br/>Complexity?}

    JsonChoice -->|Simple| BlobStorage[Blob Storage]
    JsonChoice -->|Complex| CosmosDB

    Files -->|Yes, big data| ADLS[Data Lake Gen2]
    Files -->|No, general storage| BlobStorage

    style AzureSQL fill:#e8eaf6
    style SynapseDedicated fill:#fff3e0
    style CosmosDB fill:#f3e5f5
    style ADLS fill:#e8f5e9
    style BlobStorage fill:#e0f2f1
    style TimeSeriesDB fill:#fce4ec

Storage Service Comparison

Service Data Model Scale Latency ACID Best Use Case
Data Lake Gen2 Hierarchical files Unlimited High No (Delta Lake adds) Big data analytics, data lakes
Cosmos DB Multi-model Very high Very low Yes Global apps, real-time
Azure SQL Relational High Low Yes Transactional apps
Blob Storage Object storage Unlimited Medium No General purpose, archives
Data Explorer Time-series Very high Very low No IoT, logs, telemetry

Storage Selection by Workload

Data Lake Foundation

Recommended: Data Lake Storage Gen2

Key Features:

  • Hierarchical namespace for efficient data organization
  • Fine-grained access control with ACLs
  • Optimized for analytics workloads
  • Native integration with Synapse and Databricks

Configuration:

graph TB
    subgraph "Data Lake Zones"
        Raw[Raw Zone<br/>Landing area]
        Curated[Curated Zone<br/>Processed data]
        Consumption[Consumption Zone<br/>Business-ready]
    end

    subgraph "Access Tiers"
        Hot[Hot Tier<br/>Frequent access]
        Cool[Cool Tier<br/>Infrequent access]
        Archive[Archive Tier<br/>Long-term storage]
    end

    Raw --> Hot
    Curated --> Hot
    Curated --> Cool
    Consumption --> Hot
    Raw --> Archive

    style Raw fill:#ffebee
    style Curated fill:#e3f2fd
    style Consumption fill:#e8f5e9

Real-time Applications

Recommended: Cosmos DB

When to Choose:

  • Global distribution required
  • Low latency (<10ms) needed
  • Multi-model data (document, graph, key-value)
  • Elastic scale required

API Selection:

API Use Case Best For
SQL (Core) General purpose JSON documents, flexible queries
MongoDB MongoDB compatibility Migration from MongoDB
Cassandra Wide-column Time-series, IoT data
Gremlin Graph data Social networks, recommendations
Table Key-value Simple lookups, high throughput

🔄 Streaming Service Selection

Streaming Service Decision Tree

flowchart TD
    Start([Choose Streaming Service])

    Start --> Purpose{Primary<br/>Purpose?}

    Purpose -->|Ingestion| Throughput{Expected<br/>Throughput?}
    Purpose -->|Processing| ProcessType{Processing<br/>Complexity?}
    Purpose -->|Routing| RoutingType{Event<br/>Distribution?}

    Throughput -->|Very High<br/>>1M events/sec| EventHubsDedicated[Event Hubs<br/>Dedicated]
    Throughput -->|High<br/>100K-1M/sec| EventHubsStandard[Event Hubs<br/>Standard]
    Throughput -->|Medium<br/><100K/sec| EventHubsBasic[Event Hubs<br/>Basic]

    ProcessType -->|SQL-based| StreamAnalytics[Stream Analytics]
    ProcessType -->|Complex/Code| SparkStreaming[Synapse Spark<br/>Structured Streaming]

    RoutingType -->|Many subscribers| EventGrid[Event Grid]
    RoutingType -->|Few subscribers| EventHubs[Event Hubs]

    style EventHubsDedicated fill:#e1f5fe
    style EventHubsStandard fill:#e8f5e9
    style EventHubsBasic fill:#fff3e0
    style StreamAnalytics fill:#f3e5f5
    style SparkStreaming fill:#fce4ec
    style EventGrid fill:#fff9c4

Streaming Service Comparison

Service Purpose Throughput Latency Processing Complexity
Event Hubs Event ingestion Very High Low No Low
Stream Analytics Stream processing Medium Sub-second SQL-based Medium
Event Grid Event routing High Seconds No Low
Spark Streaming Complex processing Medium Seconds Code-based High

Streaming Architecture Patterns

IoT Telemetry Processing

Recommended Stack:

graph LR
    Devices[IoT Devices] --> IoTHub[IoT Hub]
    IoTHub --> EventHubs[Event Hubs]
    EventHubs --> StreamAnalytics[Stream Analytics]
    StreamAnalytics --> HotPath[Hot Path<br/>Cosmos DB]
    StreamAnalytics --> ColdPath[Cold Path<br/>Data Lake]

    style Devices fill:#e8f5e9
    style EventHubs fill:#e1f5fe
    style StreamAnalytics fill:#fff3e0
    style HotPath fill:#ffebee
    style ColdPath fill:#e8eaf6

Services:

  • Ingestion: IoT Hub → Event Hubs
  • Processing: Stream Analytics
  • Hot Storage: Cosmos DB (real-time queries)
  • Cold Storage: Data Lake Gen2 (historical analysis)

Event-Driven Microservices

Recommended Stack:

graph TB
    Apps[Applications] --> EventGrid[Event Grid]
    EventGrid --> Func[Azure Functions]
    EventGrid --> Logic[Logic Apps]
    EventGrid --> EventHubs[Event Hubs]
    EventHubs --> StreamAnalytics[Stream Analytics]

    style Apps fill:#e8f5e9
    style EventGrid fill:#fff3e0
    style Func fill:#e1f5fe
    style Logic fill:#f3e5f5
    style EventHubs fill:#e8eaf6

Services:

  • Event Routing: Event Grid
  • Event Processing: Azure Functions, Logic Apps
  • Event Store: Event Hubs (if needed)
  • Analytics: Stream Analytics (for aggregations)

🔧 Orchestration Service Selection

Orchestration Decision Tree

flowchart TD
    Start([Choose Orchestration])

    Start --> Type{Orchestration<br/>Type?}

    Type -->|Data Movement| DataMovement{Source<br/>Location?}
    Type -->|Workflow Automation| Workflow{Complexity?}
    Type -->|Job Scheduling| Scheduling{Execution<br/>Environment?}

    DataMovement -->|Cloud to Cloud| ADF[Azure Data Factory]
    DataMovement -->|On-prem to Cloud| ADFWithIR[Data Factory +<br/>Self-hosted IR]
    DataMovement -->|Real-time Sync| ChangeDataCapture[Change Data Capture]

    Workflow -->|Simple, Low-code| LogicApps[Logic Apps]
    Workflow -->|Complex, Code-based| ADFPipelines[Data Factory Pipelines]
    Workflow -->|ML Workflows| MLPipelines[Azure ML Pipelines]

    Scheduling -->|Spark Jobs| SynapseNotebooks[Synapse Notebooks]
    Scheduling -->|SQL Jobs| SynapseSQL[Synapse SQL Jobs]
    Scheduling -->|Mixed| ADF

    style ADF fill:#fff3e0
    style LogicApps fill:#e8f5e9
    style SynapseNotebooks fill:#f3e5f5

Orchestration Service Comparison

Service Primary Use Complexity Coding Required Integration Best For
Data Factory Data integration Medium Optional Excellent ETL/ELT pipelines
Logic Apps Workflow automation Low No 300+ connectors Business workflows
Synapse Pipelines Analytics orchestration Medium Optional Native Synapse Unified analytics
Azure Functions Event processing High Yes Flexible Custom logic

💰 Cost Optimization Guidance

Service Selection by Budget

flowchart TD
    Start([Budget Optimization])

    Start --> Priority{Primary<br/>Priority?}

    Priority -->|Lowest Cost| CostOptimized{Workload<br/>Pattern?}
    Priority -->|Best Performance| Performance{Use<br/>Case?}
    Priority -->|Balanced| Balanced{Data<br/>Volume?}

    CostOptimized -->|Sporadic queries| ServerlessSQL[Synapse Serverless SQL<br/>Pay-per-query]
    CostOptimized -->|Batch processing| HDInsight[HDInsight<br/>VM-based pricing]

    Performance -->|Real-time analytics| DedicatedPools[Synapse Dedicated +<br/>Stream Analytics]
    Performance -->|ML at scale| Databricks[Databricks Premium]

    Balanced -->|< 10TB| ServerlessSQL
    Balanced -->|10-100TB| SynapseSpark[Synapse Spark Pools]
    Balanced -->|> 100TB| DataLake[Serverless SQL +<br/>Data Lake Gen2]

    style ServerlessSQL fill:#e8f5e9
    style HDInsight fill:#fff3e0
    style SynapseSpark fill:#f3e5f5
    style Databricks fill:#ffebee
    style DataLake fill:#e1f5fe

Cost Comparison Matrix

Service Pricing Model Low Usage Cost High Usage Cost Cost Predictability
Synapse Serverless Pay-per-query Very Low Medium Variable
Synapse Dedicated Reserved capacity High Medium Predictable
Databricks Compute + DBU Medium High Variable
HDInsight VM-based Low Low Predictable
Stream Analytics Streaming Units Low Medium Predictable

🎯 Quick Service Selector

By Primary Use Case

Use Case Recommended Services Alternative Option
Enterprise DW Synapse Dedicated SQL Synapse Serverless SQL
Data Lake Analytics Synapse Spark + Data Lake Gen2 Databricks + Data Lake Gen2
Real-time Dashboards Stream Analytics + Event Hubs + Power BI Synapse Spark Streaming
ML/AI Workloads Databricks Synapse Spark + Azure ML
IoT Analytics IoT Hub + Event Hubs + Stream Analytics Event Hubs + Synapse
Data Integration Data Factory Synapse Pipelines
Operational Analytics Cosmos DB + Synapse Link Azure SQL + Change Feed

By Team Skills

Team Primary Skills Recommended Stack Complexity
SQL Developers Synapse SQL Pools + Data Factory Low
Data Engineers Synapse Spark + Data Lake Gen2 Medium
Data Scientists Databricks + MLflow Medium
Full Stack Developers Event Hubs + Functions + Cosmos DB High
Mixed Skills Synapse Analytics (all engines) Medium

📋 Service Selection Checklist

Before You Choose

  • Requirements Documented
  • Data volume and growth projections
  • Latency requirements
  • Query patterns and concurrency
  • Budget constraints
  • Team skills and preferences

  • Architecture Defined

  • Pattern selected (see Architecture Patterns)
  • Data flow mapped
  • Integration points identified
  • Security requirements documented

  • POC Planned

  • Representative data sample prepared
  • Key use cases identified for testing
  • Success criteria defined
  • Timeline established

During POC

  • Performance Validated
  • Query performance meets requirements
  • Data loading speed acceptable
  • Concurrent user testing completed
  • Scaling behavior verified

  • Costs Estimated

  • Development costs projected
  • Production costs estimated
  • Cost optimization opportunities identified
  • Budget approval secured

  • Operations Assessed

  • Monitoring and alerting configured
  • Backup and recovery tested
  • Security controls validated
  • Team trained on operations

Post-Selection

  • Implementation Roadmap
  • Phases defined
  • Dependencies mapped
  • Resource allocation confirmed
  • Timeline agreed

  • Success Metrics

  • KPIs defined
  • Monitoring configured
  • Regular review scheduled
  • Optimization plan created

Decision Support

Implementation Guides

Cost Management


💡 Key Recommendations

Start with Serverless: For new workloads with uncertain patterns, start with Synapse Serverless SQL to minimize costs while understanding requirements.

Plan for Growth: Choose services that can scale with your needs. Start simple, but ensure your architecture can evolve.

Optimize for Your Team: Select services that match your team's skills and preferences. The best technology is the one your team can effectively operate.

POC Before Committing: Always validate your service selection with a proof of concept using representative data and workloads.

Monitor and Iterate: Service selection isn't final. Regularly review usage patterns and costs, and adjust your architecture as needs evolve.


Last Updated: 2025-01-28 Services Covered: 15+ Decision Trees: 8