Skip to content

Solution Components Overview


Azure Databricks Status

Overview

This document provides detailed specifications for all components in the Azure Real-Time Analytics solution. Each component is designed for high availability, scalability, and enterprise-grade security.

Table of Contents


Data Ingestion Components

Azure Event Hubs

Purpose: High-throughput event streaming platform

Specifications:

Aspect Configuration Notes
Tier Premium Dedicated capacity
Throughput Units 20-100 (auto-inflate) Dynamic scaling
Partitions 32 per hub Parallel processing
Retention 7 days Extended retention
Capture Enabled to ADLS Gen2 Automatic backup

Features:

  • Kafka protocol compatibility
  • Geo-disaster recovery
  • Virtual network integration
  • Managed identity authentication
  • Auto-inflate enabled

Configuration Example:

az eventhubs namespace create \
  --name realtime-events-prod \
  --resource-group realtime-analytics-rg \
  --location eastus \
  --sku Premium \
  --capacity 20 \
  --enable-auto-inflate true \
  --maximum-throughput-units 100

Confluent Kafka

Purpose: Enterprise-grade streaming platform for high-volume data

Specifications:

Component Configuration Purpose
Cluster Type Dedicated Isolated resources
Brokers 3-9 nodes High availability
Storage 1-10 TB per broker Message retention
Replication Factor 3 Data durability
Availability Zones Multi-AZ Fault tolerance

Features:

  • Schema Registry integration
  • ksqlDB for stream processing
  • Cluster linking for geo-replication
  • Role-based access control
  • Confluent Control Center monitoring

Topics Configuration:

# Production topics
events.raw.v1
  partitions: 32
  replication.factor: 3
  retention.ms: 604800000  # 7 days

events.processed.v1
  partitions: 16
  replication.factor: 3
  retention.ms: 259200000  # 3 days

Azure Stream Analytics

Purpose: Real-time stream processing for simple transformations

Specifications:

Aspect Value Description
Streaming Units 6-120 (auto-scale) Processing capacity
Compatibility Level 1.2 Latest features
Output Error Policy Drop Error handling
Event Ordering Adjust 5-second tolerance

Use Cases:

  • IoT telemetry processing
  • Log aggregation and filtering
  • Real-time alerting
  • Stream-to-storage routing

Processing Components

Azure Databricks

Purpose: Unified analytics platform for batch and streaming

Workspace Configuration:

Component Specification Notes
Tier Premium Unity Catalog enabled
Runtime 13.3 LTS ML Long-term support
Cluster Policy Job, Interactive Enforced policies
Network VNet injection Private networking
Storage ADLS Gen2 Delta Lake storage

Cluster Configurations:

High Concurrency Cluster

{
  "cluster_name": "shared-analytics-cluster",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "autoscale": {
    "min_workers": 2,
    "max_workers": 50
  },
  "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true",
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.autoCompact.enabled": "true"
  },
  "enable_elastic_disk": true,
  "data_security_mode": "USER_ISOLATION"
}

Streaming Job Cluster

{
  "cluster_name": "streaming-processor",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_L8s_v2",
  "num_workers": 12,
  "spark_conf": {
    "spark.streaming.backpressure.enabled": "true",
    "spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true"
  },
  "spot_bid_price_percent": 100,
  "enable_local_disk_encryption": true
}

Delta Lake

Purpose: ACID-compliant data lake storage layer

Configuration:

Feature Setting Benefit
Auto Optimize Enabled Automatic file compaction
Optimize Write Enabled Better file sizes
Data Skipping Enabled Query performance
Z-Ordering Business keys Colocation optimization
Vacuum 7 days retention Storage cleanup

Table Properties:

CREATE TABLE gold.customer_analytics (
  customer_id STRING NOT NULL,
  event_timestamp TIMESTAMP,
  metrics MAP<STRING, DOUBLE>,
  aggregations STRUCT<...>
)
USING DELTA
PARTITIONED BY (DATE(event_timestamp))
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.enableChangeDataFeed' = 'true',
  'delta.deletedFileRetentionDuration' = 'interval 7 days'
);

Azure Data Factory

Purpose: Orchestration and ETL pipeline management

Components:

Component Purpose Count
Pipelines Workflow orchestration 50+
Linked Services Connection management 25+
Datasets Data definitions 100+
Triggers Schedule/event-based 30+
Integration Runtime Compute environment 3

Pipeline Patterns:

  • Incremental data loading
  • Full refresh with validation
  • Change data capture (CDC)
  • Event-driven processing
  • Scheduled batch jobs

Storage Components

Azure Data Lake Storage Gen2

Purpose: Scalable, secure data lake storage

Configuration:

Feature Setting Purpose
Tier Premium Low latency
Replication ZRS Zone redundancy
Hierarchical Namespace Enabled Directory operations
Encryption Customer-managed keys Data security
Lifecycle Management Hot/Cool/Archive Cost optimization

Container Structure:

realtime-analytics/
├── bronze/              # Raw ingested data
│   ├── events/
│   ├── logs/
│   └── telemetry/
├── silver/              # Cleansed data
│   ├── validated/
│   ├── enriched/
│   └── deduplicated/
├── gold/                # Business-ready data
│   ├── aggregations/
│   ├── dimensions/
│   └── facts/
└── checkpoints/         # Streaming checkpoints
    └── streaming/

Lifecycle Policies:

{
  "rules": [
    {
      "name": "MoveBronzeToCool",
      "type": "Lifecycle",
      "definition": {
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 90
            }
          }
        },
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["bronze/"]
        }
      }
    }
  ]
}

Unity Catalog

Purpose: Unified governance for data and AI assets

Hierarchy:

Catalog: realtime_analytics
├── Schema: bronze
│   ├── Tables: events, logs, telemetry
│   └── Volumes: raw_files
├── Schema: silver
│   ├── Tables: validated_events, enriched_data
│   └── Views: latest_events
└── Schema: gold
    ├── Tables: customer_metrics, product_analytics
    ├── Views: executive_dashboard
    └── Functions: calculate_metrics()

Security Model:

-- Grant access to data engineers
GRANT USE CATALOG ON CATALOG realtime_analytics TO `data-engineers`;
GRANT USE SCHEMA, SELECT, MODIFY ON SCHEMA realtime_analytics.silver TO `data-engineers`;

-- Grant read-only access to analysts
GRANT USE CATALOG ON CATALOG realtime_analytics TO `analysts`;
GRANT USE SCHEMA, SELECT ON SCHEMA realtime_analytics.gold TO `analysts`;

Analytics Components

Azure OpenAI

Purpose: AI-powered analytics and insights

Deployments:

Model Version Purpose RPM Limit
GPT-4 0125-Preview Advanced reasoning 10K
GPT-3.5-Turbo 0125 High throughput 60K
Text-Embedding-3-Large 3.0 Vector search 100K

Use Cases:

  • Natural language query generation
  • Automated data insights
  • Anomaly explanation
  • Report summarization
  • Predictive analytics enhancement

Integration Example:

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-02-15-preview"
)

# Generate SQL from natural language
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "system",
        "content": "Generate SQL for Azure Databricks Delta tables"
    }, {
        "role": "user",
        "content": "Show top 10 customers by revenue last month"
    }]
)

MLflow

Purpose: Machine learning lifecycle management

Components:

Component Purpose Storage
Tracking Server Experiment tracking Azure SQL
Model Registry Model versioning ADLS Gen2
Artifacts Store Model artifacts ADLS Gen2

Model Management:

import mlflow
from mlflow.tracking import MlflowClient

# Configure MLflow
mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

# Log experiment
with mlflow.start_run(run_name="customer_churn_v3"):
    mlflow.log_params({"max_depth": 10, "learning_rate": 0.01})
    mlflow.log_metrics({"accuracy": 0.94, "f1_score": 0.92})
    mlflow.sklearn.log_model(model, "model",
                             registered_model_name="customer_churn")

Power BI

Purpose: Business intelligence and visualization

Configuration:

Component Specification Purpose
Capacity P1 (2 cores) Premium features
Mode Direct Lake Real-time queries
Gateway Not required Direct connection
Refresh Real-time Live data

Features:

  • Direct Lake from Delta tables
  • Composite models
  • Incremental refresh
  • Query folding
  • RLS (Row-Level Security)

Governance Components

Azure Purview

Purpose: Data governance and discovery

Features:

Feature Status Purpose
Data Catalog ✅ Active Asset discovery
Data Lineage ✅ Active Impact analysis
Data Classification ✅ Active Sensitivity labeling
Scanning ✅ Automated Metadata extraction

Scanned Sources:

  • Azure Data Lake Storage Gen2
  • Azure Databricks (Unity Catalog)
  • Azure SQL Database
  • Power BI datasets

Azure Policy

Purpose: Governance and compliance enforcement

Policies:

{
  "policyName": "Require-Private-Endpoints",
  "description": "Enforce private endpoints for all data services",
  "effect": "Deny",
  "resources": [
    "Microsoft.Storage/storageAccounts",
    "Microsoft.Databricks/workspaces",
    "Microsoft.EventHub/namespaces"
  ]
}

Security Components

Azure Key Vault

Purpose: Secrets and certificate management

Stored Secrets:

  • Database connection strings
  • Service principal credentials
  • API keys and tokens
  • Encryption keys
  • SSL certificates

Access Policies:

# Grant Databricks access to secrets
az keyvault set-policy \
  --name realtime-kv-prod \
  --object-id <databricks-msi> \
  --secret-permissions get list

Purpose: Private network connectivity

Endpoints:

Service Endpoint Type Purpose
ADLS Gen2 Private Storage access
Event Hubs Private Event streaming
Key Vault Private Secret access
Databricks Private Workspace access

Monitoring Components

Azure Monitor

Purpose: Platform monitoring and alerting

Components:

  • Application Insights (application telemetry)
  • Log Analytics (centralized logging)
  • Metrics (performance monitoring)
  • Alerts (proactive notifications)
  • Workbooks (custom dashboards)

Key Metrics:

// Event Hubs throughput
AzureMetrics
| where ResourceProvider == "MICROSOFT.EVENTHUB"
| where MetricName == "IncomingMessages"
| summarize Total = sum(Total) by bin(TimeGenerated, 5m)

Datadog

Purpose: Advanced APM and infrastructure monitoring

Integrations:

  • Azure Monitor integration
  • Databricks metrics
  • Custom application metrics
  • Synthetic monitoring
  • Real user monitoring (RUM)

Component Dependencies

Dependency Matrix

graph TB
    subgraph Ingestion["Ingestion Layer"]
        EH[Event Hubs]
        K[Kafka]
    end

    subgraph Processing["Processing Layer"]
        DB[Databricks]
        ADF[Data Factory]
    end

    subgraph Storage["Storage Layer"]
        ADLS[ADLS Gen2]
        DL[Delta Lake]
        UC[Unity Catalog]
    end

    subgraph Analytics["Analytics Layer"]
        ML[MLflow]
        AI[Azure OpenAI]
        PBI[Power BI]
    end

    subgraph Security["Security Layer"]
        KV[Key Vault]
        AAD[Azure AD]
        PL[Private Link]
    end

    EH --> DB
    K --> DB
    DB --> ADLS
    DB --> DL
    DL --> ADLS
    DB --> ML
    DB --> AI
    DL --> PBI
    UC --> DB
    KV --> DB
    KV --> ADF
    AAD --> DB
    AAD --> PBI
    PL --> ADLS
    PL --> EH

Critical Path

  1. Data Ingestion: Event Hubs/Kafka → Databricks
  2. Processing: Databricks → Delta Lake → ADLS Gen2
  3. Governance: Unity Catalog → Access Control
  4. Analytics: Power BI → Direct Lake → Delta Tables
  5. Security: Azure AD → Key Vault → Private Link

Component Sizing Guide

Small Deployment (Dev/Test)

Component Size Monthly Cost
Event Hubs Basic, 2 TU $50
Databricks Standard, 2 workers $500
ADLS Gen2 100 GB $5
Total ~$600

Medium Deployment (Production)

Component Size Monthly Cost
Event Hubs Premium, 20 TU $2,000
Databricks Premium, 10 workers avg $5,000
ADLS Gen2 10 TB $500
Power BI P1 capacity $5,000
Total ~$13,000

Large Deployment (Enterprise)

Component Size Monthly Cost
Event Hubs Premium, 100 TU $10,000
Databricks Premium, 50 workers avg $25,000
ADLS Gen2 100 TB $2,500
Power BI P3 capacity $20,000
Total ~$60,000


Last Updated: January 2025 Version: 1.0.0 Status: Production Ready