Skip to content

<

Streaming Data Fundamentals

< Home | = Documentation | < Tutorials | < Beginner | < Streaming

Status Level Duration

Master the fundamentals of streaming data and real-time analytics. Learn core concepts, patterns, and when to use streaming vs batch processing.

< Learning Objectives

After completing this tutorial, you will understand:

  • Difference between batch and streaming processing
  • Core streaming concepts (events, windows, watermarks)
  • Common streaming patterns and use cases
  • Azure streaming services and when to use each
  • Trade-offs between different approaches

= Batch vs Streaming Processing

Batch Processing

Process large volumes of data at scheduled intervals.

Characteristics:

  • Processes historical data
  • Scheduled execution (hourly, daily, weekly)
  • Optimized for throughput
  • Higher latency acceptable

Example: Daily sales reports, monthly financial statements, yearly trend analysis

# Batch Processing Example
def daily_sales_report():
    # Process all yesterday's transactions
    yesterday_data = load_data(date="2025-01-08")
    summary = aggregate(yesterday_data)
    save_report(summary)

# Runs once per day at midnight
schedule_job(daily_sales_report, cron="0 0 * * *")

Streaming Processing

Process data continuously as it arrives, in real-time or near-real-time.

Characteristics:

  • Processes data immediately upon arrival
  • Continuous execution
  • Optimized for low latency
  • Handles unbounded data streams

Example: Fraud detection, IoT monitoring, real-time dashboards

# Streaming Processing Example
def process_transaction(transaction):
    # Process each transaction immediately
    if is_fraud(transaction):
        alert_security(transaction)
        block_transaction(transaction)
    save_to_database(transaction)

# Processes events continuously
event_stream.subscribe(process_transaction)

Comparison Table

Aspect Batch Processing Streaming Processing
Latency Minutes to hours Milliseconds to seconds
Data Volume Large datasets Continuous small events
Use Cases Reports, analytics Monitoring, alerts, real-time
Complexity Lower Higher
Cost Generally lower Can be higher
Resource Usage Periodic spikes Constant utilization

= Core Streaming Concepts

1. Events

An event is a record of something that happened.

Event Structure:

{
  "event_id": "evt_12345",
  "event_type": "purchase",
  "timestamp": "2025-01-09T10:30:45.123Z",
  "payload": {
    "user_id": "user_789",
    "product_id": "prod_456",
    "amount": 99.99,
    "currency": "USD"
  }
}

Event Properties:

  • Immutable: Once created, never changes
  • Timestamped: When it occurred
  • Ordered: Sequence matters (within partition)
  • Append-only: New events added, old ones never modified

2. Event Time vs Processing Time

Understanding time is critical in streaming:

Event Time: When the event actually occurred

event_time = transaction["timestamp"]  # 2025-01-09 10:30:45

Processing Time: When the event is processed by your system

processing_time = datetime.utcnow()  # 2025-01-09 10:31:12

Why It Matters:

  • Network delays can cause events to arrive late
  • Devices might be offline and send batched events later
  • Time zones and clock skew issues

Example Scenario:

```textIoT sensor records temperature at 10:00 AM (event time) Network outage until 11:00 AM Event arrives at server at 11:15 AM (processing time) We want to analyze temperature at 10:00 AM, not 11:15 AM!

### __3. Windows__

Windows divide continuous streams into bounded chunks for aggregation.

#### __Tumbling Windows__ (Non-Overlapping)

Fixed-size, non-overlapping time intervals.

```textTime:     0    5    10   15   20   25   30
Windows:  [----][----][----][----][----][----]
          Win1  Win2  Win3  Win4  Win5  Win6

Use Case: Calculate average every 5 minutes

# Tumbling window example
SELECT
    SYSTEM_TIMESTAMP() AS window_end,
    AVG(temperature) AS avg_temp
FROM sensor_stream
GROUP BY TumblingWindow(minute, 5)

Hopping Windows (Overlapping)

Fixed-size windows that advance by smaller intervals.

```textTime: 0 5 10 15 20 25 30 Windows: [--------] [--------] [--------] [--------]

__Use Case:__ Moving averages, trend detection

```python
# Hopping window: 10-min window, 5-min hop
SELECT
    SYSTEM_TIMESTAMP() AS window_end,
    AVG(temperature) AS avg_temp
FROM sensor_stream
GROUP BY HoppingWindow(minute, 10, 5)

Sliding Windows (Event-Triggered)

Window advances with each new event.

```textEvents: E1 E2 E3 E4 Windows: [---E1,E2---] [---E2,E3---] [---E3,E4---]

__Use Case:__ Detect 3 failed logins within 1 minute

```python
# Sliding window example
SELECT
    user_id,
    COUNT(*) AS failed_attempts
FROM login_events
WHERE status = 'failed'
GROUP BY user_id, SlidingWindow(minute, 1)
HAVING COUNT(*) >= 3

Session Windows (Gap-Based)

Variable-size windows based on periods of inactivity.

```textEvents: E1 E2 (gap) E3 E4 E5 (gap) E6 Sessions: [--S1--] [---S2---] [S3]

__Use Case:__ User session analysis

```python
# Session window: timeout after 10 min inactivity
SELECT
    user_id,
    COUNT(*) AS events_in_session,
    MAX(event_time) - MIN(event_time) AS session_duration
FROM clickstream
GROUP BY user_id, SessionWindow(minute, 10)

4. Watermarks

Watermarks track progress in event time and handle late data.

Problem: Events don't always arrive in order

```textEvent Time: 10:00 10:01 10:02 10:03 10:04 Arrival Time: 10:01 10:02 10:04 10:03 10:05 ^ ^ ^ ^ ^ OK OK Late! OK OK

__Watermark Solution:__

```textWatermark: "I've processed all events up to time T"

Example:
- Watermark at 10:03 means:
  "All events with event_time <= 10:03 have been processed"

- If event with time 10:02 arrives after watermark:
  It's late data - handle specially!

Handling Late Data:

  1. Drop Late Events: Ignore events past watermark
  2. Side Output: Send late events to separate stream
  3. Allowed Lateness: Accept events up to X minutes late
  4. Update Results: Recalculate when late data arrives

 Azure Streaming Services

Azure Event Hubs

Purpose

What It Does: Receives and buffers streaming events

Best For:

  • High-throughput event ingestion (millions of events/sec)
  • IoT telemetry collection
  • Application logging at scale
  • Clickstream data capture

Example:

# Send events to Event Hub
from azure.eventhub import EventHubProducerClient, EventData

producer = EventHubProducerClient.from_connection_string(conn_str, eventhub_name)
batch = await producer.create_batch()
batch.add(EventData("Temperature: 75F"))
await producer.send_batch(batch)

Azure Stream Analytics

Purpose

What It Does: Processes and analyzes streaming data using SQL

Best For:

  • Real-time analytics with SQL queries
  • Time-windowed aggregations
  • Anomaly detection
  • Routing/filtering events

Example:

-- Stream Analytics query
SELECT
    System.Timestamp() AS WindowEnd,
    DeviceId,
    AVG(Temperature) AS AvgTemp,
    MAX(Temperature) AS MaxTemp
INTO
    [output-power-bi]
FROM
    [input-eventhub]
GROUP BY
    DeviceId,
    TumblingWindow(minute, 5)

= Best Practices

1. Design for Idempotency

Assume events might be processed multiple times:

# Bad: Not idempotent
def process_purchase(event):
    inventory_count -= event["quantity"]  # Double processing = wrong count!

# Good: Idempotent
def process_purchase(event):
    if not already_processed(event["event_id"]):
        inventory_count -= event["quantity"]
        mark_processed(event["event_id"])

2. Handle Late Data

# Configure allowed lateness
stream.window(
    TumblingWindow(minutes=5),
    allowed_lateness=minutes(2)  # Accept events up to 2 min late
)

< Next Steps

Practice Exercises

  1. Build Temperature Monitoring
  2. Ingest sensor data to Event Hubs
  3. Calculate rolling averages with windows
  4. Alert when temperature exceeds threshold

  5. Clickstream Analytics

  6. Process website clicks
  7. Count page views by window
  8. Detect user sessions

Continue Learning


Ready for hands-on practice? Try the Event Hubs Quickstart!


Last Updated: January 2025 Tutorial Version: 1.0