<¶
Streaming Data Fundamentals
< Home | = Documentation | < Tutorials | < Beginner | < Streaming
Master the fundamentals of streaming data and real-time analytics. Learn core concepts, patterns, and when to use streaming vs batch processing.
< Learning Objectives¶
After completing this tutorial, you will understand:
- Difference between batch and streaming processing
- Core streaming concepts (events, windows, watermarks)
- Common streaming patterns and use cases
- Azure streaming services and when to use each
- Trade-offs between different approaches
= Batch vs Streaming Processing¶
Batch Processing¶
Process large volumes of data at scheduled intervals.
Characteristics:
- Processes historical data
- Scheduled execution (hourly, daily, weekly)
- Optimized for throughput
- Higher latency acceptable
Example: Daily sales reports, monthly financial statements, yearly trend analysis
# Batch Processing Example
def daily_sales_report():
# Process all yesterday's transactions
yesterday_data = load_data(date="2025-01-08")
summary = aggregate(yesterday_data)
save_report(summary)
# Runs once per day at midnight
schedule_job(daily_sales_report, cron="0 0 * * *")
Streaming Processing¶
Process data continuously as it arrives, in real-time or near-real-time.
Characteristics:
- Processes data immediately upon arrival
- Continuous execution
- Optimized for low latency
- Handles unbounded data streams
Example: Fraud detection, IoT monitoring, real-time dashboards
# Streaming Processing Example
def process_transaction(transaction):
# Process each transaction immediately
if is_fraud(transaction):
alert_security(transaction)
block_transaction(transaction)
save_to_database(transaction)
# Processes events continuously
event_stream.subscribe(process_transaction)
Comparison Table¶
| Aspect | Batch Processing | Streaming Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data Volume | Large datasets | Continuous small events |
| Use Cases | Reports, analytics | Monitoring, alerts, real-time |
| Complexity | Lower | Higher |
| Cost | Generally lower | Can be higher |
| Resource Usage | Periodic spikes | Constant utilization |
= Core Streaming Concepts¶
1. Events¶
An event is a record of something that happened.
Event Structure:
{
"event_id": "evt_12345",
"event_type": "purchase",
"timestamp": "2025-01-09T10:30:45.123Z",
"payload": {
"user_id": "user_789",
"product_id": "prod_456",
"amount": 99.99,
"currency": "USD"
}
}
Event Properties:
- Immutable: Once created, never changes
- Timestamped: When it occurred
- Ordered: Sequence matters (within partition)
- Append-only: New events added, old ones never modified
2. Event Time vs Processing Time¶
Understanding time is critical in streaming:
Event Time: When the event actually occurred
Processing Time: When the event is processed by your system
Why It Matters:
- Network delays can cause events to arrive late
- Devices might be offline and send batched events later
- Time zones and clock skew issues
Example Scenario:
```textIoT sensor records temperature at 10:00 AM (event time) Network outage until 11:00 AM Event arrives at server at 11:15 AM (processing time) We want to analyze temperature at 10:00 AM, not 11:15 AM!
### __3. Windows__
Windows divide continuous streams into bounded chunks for aggregation.
#### __Tumbling Windows__ (Non-Overlapping)
Fixed-size, non-overlapping time intervals.
```textTime: 0 5 10 15 20 25 30
Windows: [----][----][----][----][----][----]
Win1 Win2 Win3 Win4 Win5 Win6
Use Case: Calculate average every 5 minutes
# Tumbling window example
SELECT
SYSTEM_TIMESTAMP() AS window_end,
AVG(temperature) AS avg_temp
FROM sensor_stream
GROUP BY TumblingWindow(minute, 5)
Hopping Windows (Overlapping)¶
Fixed-size windows that advance by smaller intervals.
```textTime: 0 5 10 15 20 25 30 Windows: [--------] [--------] [--------] [--------]
__Use Case:__ Moving averages, trend detection
```python
# Hopping window: 10-min window, 5-min hop
SELECT
SYSTEM_TIMESTAMP() AS window_end,
AVG(temperature) AS avg_temp
FROM sensor_stream
GROUP BY HoppingWindow(minute, 10, 5)
Sliding Windows (Event-Triggered)¶
Window advances with each new event.
```textEvents: E1 E2 E3 E4 Windows: [---E1,E2---] [---E2,E3---] [---E3,E4---]
__Use Case:__ Detect 3 failed logins within 1 minute
```python
# Sliding window example
SELECT
user_id,
COUNT(*) AS failed_attempts
FROM login_events
WHERE status = 'failed'
GROUP BY user_id, SlidingWindow(minute, 1)
HAVING COUNT(*) >= 3
Session Windows (Gap-Based)¶
Variable-size windows based on periods of inactivity.
```textEvents: E1 E2 (gap) E3 E4 E5 (gap) E6 Sessions: [--S1--] [---S2---] [S3]
__Use Case:__ User session analysis
```python
# Session window: timeout after 10 min inactivity
SELECT
user_id,
COUNT(*) AS events_in_session,
MAX(event_time) - MIN(event_time) AS session_duration
FROM clickstream
GROUP BY user_id, SessionWindow(minute, 10)
4. Watermarks¶
Watermarks track progress in event time and handle late data.
Problem: Events don't always arrive in order
```textEvent Time: 10:00 10:01 10:02 10:03 10:04 Arrival Time: 10:01 10:02 10:04 10:03 10:05 ^ ^ ^ ^ ^ OK OK Late! OK OK
__Watermark Solution:__
```textWatermark: "I've processed all events up to time T"
Example:
- Watermark at 10:03 means:
"All events with event_time <= 10:03 have been processed"
- If event with time 10:02 arrives after watermark:
It's late data - handle specially!
Handling Late Data:
- Drop Late Events: Ignore events past watermark
- Side Output: Send late events to separate stream
- Allowed Lateness: Accept events up to X minutes late
- Update Results: Recalculate when late data arrives
Azure Streaming Services¶
Azure Event Hubs¶
What It Does: Receives and buffers streaming events
Best For:
- High-throughput event ingestion (millions of events/sec)
- IoT telemetry collection
- Application logging at scale
- Clickstream data capture
Example:
# Send events to Event Hub
from azure.eventhub import EventHubProducerClient, EventData
producer = EventHubProducerClient.from_connection_string(conn_str, eventhub_name)
batch = await producer.create_batch()
batch.add(EventData("Temperature: 75F"))
await producer.send_batch(batch)
Azure Stream Analytics¶
What It Does: Processes and analyzes streaming data using SQL
Best For:
- Real-time analytics with SQL queries
- Time-windowed aggregations
- Anomaly detection
- Routing/filtering events
Example:
-- Stream Analytics query
SELECT
System.Timestamp() AS WindowEnd,
DeviceId,
AVG(Temperature) AS AvgTemp,
MAX(Temperature) AS MaxTemp
INTO
[output-power-bi]
FROM
[input-eventhub]
GROUP BY
DeviceId,
TumblingWindow(minute, 5)
= Best Practices¶
1. Design for Idempotency¶
Assume events might be processed multiple times:
# Bad: Not idempotent
def process_purchase(event):
inventory_count -= event["quantity"] # Double processing = wrong count!
# Good: Idempotent
def process_purchase(event):
if not already_processed(event["event_id"]):
inventory_count -= event["quantity"]
mark_processed(event["event_id"])
2. Handle Late Data¶
# Configure allowed lateness
stream.window(
TumblingWindow(minutes=5),
allowed_lateness=minutes(2) # Accept events up to 2 min late
)
< Next Steps¶
Practice Exercises¶
- Build Temperature Monitoring
- Ingest sensor data to Event Hubs
- Calculate rolling averages with windows
-
Alert when temperature exceeds threshold
-
Clickstream Analytics
- Process website clicks
- Count page views by window
- Detect user sessions
Continue Learning¶
- Event Hubs Quickstart - Hands-on with Event Hubs
- Stream Analytics Tutorial - Build streaming queries
- Real-Time Analytics Solution - End-to-end architecture
Ready for hands-on practice? Try the Event Hubs Quickstart!
Last Updated: January 2025 Tutorial Version: 1.0