🔥 Azure Databricks Quickstart¶

Get started with Azure Databricks in under an hour. Learn to create a workspace, run Spark notebooks, and process data at scale.

🎯 Learning Objectives¶

After completing this quickstart, you will be able to:

Understand what Azure Databricks is and its capabilities
Create a Databricks workspace
Launch and configure a Spark cluster
Create and run notebooks with PySpark
Load and analyze data from ADLS Gen2
Visualize results with built-in charts

📋 Prerequisites¶

Before starting, ensure you have:

Azure subscription - Create free account
Azure Portal access - portal.azure.com
Basic Python knowledge - Understanding of variables, loops, functions
ADLS Gen2 account (optional) - Create one or use sample data

🔍 What is Azure Databricks?¶

Azure Databricks is an Apache Spark-based analytics platform optimized for Azure, providing:

Unified analytics - Data engineering, data science, and ML
Collaborative notebooks - Interactive development environment
Auto-scaling clusters - Elastic compute resources
Delta Lake - Reliable data lakes with ACID transactions
Integration - Seamless Azure service connectivity

Key Concepts¶

Workspace: Environment for notebooks, clusters, and data
Cluster: Set of computation resources (VMs)
Notebook: Interactive document with code and visualizations
Delta Lake: Storage layer with reliability and performance
Job: Scheduled execution of notebooks or JARs

When to Use Databricks¶

✅ Good For:

Big data processing (TB-PB scale)
ETL/ELT pipelines
Machine learning workflows
Real-time streaming analytics
Collaborative data science

❌ Not Ideal For:

Small datasets (<1GB)
Simple SQL queries (use Synapse Serverless)
Real-time low-latency apps (use Event Hubs)

🚀 Step 1: Create Databricks Workspace¶

Using Azure Portal¶

Navigate to Azure Portal
Go to portal.azure.com
Click "Create a resource"
Search for "Azure Databricks"
Click "Create"
Configure Basics
Subscription: Select your subscription
Resource Group: Create new "rg-databricks-quickstart"
Workspace Name: "databricks-quickstart-[yourname]"
Region: Select nearest region
Pricing Tier: Trial (Premium - 14 days free) or Standard
Networking (Optional)
Keep default public network access for quickstart
Review and Create
Click "Review + create"
Click "Create"
Wait 3-5 minutes for deployment

Using Azure CLI¶

# Set variables
RESOURCE_GROUP="rg-databricks-quickstart"
LOCATION="eastus"
WORKSPACE_NAME="databricks-quickstart-$RANDOM"

# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION

# Create Databricks workspace
az databricks workspace create \
  --resource-group $RESOURCE_GROUP \
  --name $WORKSPACE_NAME \
  --location $LOCATION \
  --sku trial

echo "Workspace: $WORKSPACE_NAME"

🖥️ Step 2: Launch Workspace¶

Navigate to Workspace
Go to your Databricks workspace in Azure Portal
Click "Launch Workspace"
You'll be redirected to Databricks portal
First Time Setup
May need to sign in with Azure credentials
Accept terms if prompted

⚙️ Step 3: Create Cluster¶

Create Compute Cluster¶

Navigate to Compute
Click "Compute" icon in left sidebar
Click "Create Cluster"
Configure Cluster
Cluster Name: "quickstart-cluster"
Cluster Mode: Single Node (for quickstart/development)
Databricks Runtime: Latest LTS version (e.g., 12.2 LTS)
Node Type: Standard_DS3_v2 (or smallest available)
Terminate after: 30 minutes of inactivity
Create Cluster
Click "Create Cluster"
Wait 3-5 minutes for cluster to start
Status changes to "Running" when ready

💡 Tip: Single Node mode is cheaper for learning. Use Standard mode for production workloads.

📓 Step 4: Create Your First Notebook¶

Create Notebook¶

Navigate to Workspace
Click "Workspace" icon in left sidebar
Click your username folder
Click dropdown arrow > "Create" > "Notebook"
Configure Notebook
Name: "Quickstart Tutorial"
Default Language: Python
Cluster: Select "quickstart-cluster"
Click "Create"

Run Your First Code¶

# Cell 1: Hello Databricks!
print("🔥 Hello from Databricks!")
print(f"Spark version: {spark.version}")
print(f"Running on cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")

Click "Run Cell" or press Shift+Enter

📊 Step 5: Work with DataFrames¶

Create Sample Data¶

# Cell 2: Create sample sales data
from pyspark.sql import Row
from datetime import datetime, timedelta

# Generate sample data
sales_data = [
    Row(order_id=1001, customer_id="C101", product="Laptop", amount=1299.99, order_date="2024-01-15"),
    Row(order_id=1002, customer_id="C102", product="Chair", amount=249.99, order_date="2024-01-15"),
    Row(order_id=1003, customer_id="C101", product="Monitor", amount=399.99, order_date="2024-01-16"),
    Row(order_id=1004, customer_id="C103", product="Desk", amount=549.99, order_date="2024-01-16"),
    Row(order_id=1005, customer_id="C102", product="Keyboard", amount=89.99, order_date="2024-01-17"),
    Row(order_id=1006, customer_id="C104", product="Mouse", amount=39.99, order_date="2024-01-17"),
    Row(order_id=1007, customer_id="C101", product="Lamp", amount=79.99, order_date="2024-01-18"),
    Row(order_id=1008, customer_id="C105", product="Tablet", amount=599.99, order_date="2024-01-18"),
]

# Create DataFrame
df = spark.createDataFrame(sales_data)

# Display DataFrame
display(df)

Basic DataFrame Operations¶

# Cell 3: Explore the data
print(f"Total orders: {df.count()}")
print(f"\nColumns: {df.columns}")
print(f"\nSchema:")
df.printSchema()

# Cell 4: Filter and aggregate
from pyspark.sql.functions import sum, avg, count

# Calculate total sales
total_sales = df.select(sum("amount")).collect()[0][0]
print(f"Total sales: ${total_sales:,.2f}")

# Sales by customer
customer_sales = df.groupBy("customer_id") \
    .agg(
        count("order_id").alias("order_count"),
        sum("amount").alias("total_spent")
    ) \
    .orderBy("total_spent", ascending=False)

display(customer_sales)

📈 Step 6: Visualizations¶

Create Charts¶

# Cell 5: Sales by customer
customer_sales = df.groupBy("customer_id") \
    .agg(sum("amount").alias("total_sales")) \
    .orderBy("total_sales", ascending=False)

display(customer_sales)

After running the cell:

Click the chart icon below results
Select "Bar Chart"
Configure:
Keys: customer_id
Values: total_sales
Click "Apply"

💾 Step 7: Write Data to Delta Lake¶

Save as Delta Table¶

# Cell 6: Write to Delta Lake
# Delta Lake provides ACID transactions and time travel

# Write DataFrame to Delta table
df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("sales_data")

print("✅ Data saved to Delta table: sales_data")

Read from Delta Table¶

# Cell 7: Read from Delta table
sales_df = spark.table("sales_data")

print(f"Loaded {sales_df.count()} rows from Delta table")
display(sales_df)

🔗 Step 8: Connect to ADLS Gen2 (Optional)¶

If you have an ADLS Gen2 account:

# Cell 8: Configure ADLS Gen2 access
storage_account = "your-storage-account"
container = "data"

# Set Spark configuration for storage access
spark.conf.set(
    f"fs.azure.account.key.{storage_account}.dfs.core.windows.net",
    "your-storage-account-key"  # Get from Azure Portal > Access Keys
)

# Read CSV from ADLS Gen2
adls_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/sales.csv"

df_adls = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(adls_path)

display(df_adls)

Better: Use Service Principal¶

# Cell 9: Service Principal authentication (recommended for production)
service_principal_id = "your-service-principal-id"
service_principal_secret = "your-service-principal-secret"
tenant_id = "your-tenant-id"

spark.conf.set(
    f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net",
    "OAuth"
)
spark.conf.set(
    f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net",
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net",
    service_principal_id
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net",
    service_principal_secret
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net",
    f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
)

🎯 Useful Databricks Features¶

Magic Commands¶

# %python - Run Python code (default)
# %sql - Run SQL queries
# %scala - Run Scala code
# %r - Run R code
# %sh - Run shell commands
# %fs - File system commands
# %md - Markdown for documentation

File System Commands¶

# Cell 10: Explore DBFS (Databricks File System)
%fs ls /

# List Delta tables
%fs ls /user/hive/warehouse/

# Create directory
%fs mkdirs /tmp/mydata

SQL Queries¶

# Cell 11: Query with SQL
%sql
SELECT
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent
FROM sales_data
GROUP BY customer_id
ORDER BY total_spent DESC

💡 Best Practices¶

Cluster Management¶

Use auto-termination - Set 30-60 minutes for dev clusters
Right-size clusters - Start small, scale up if needed
Use spot instances - 60-80% cost savings for fault-tolerant workloads
Pools - Pre-warmed VMs for faster cluster starts

Notebook Organization¶

Use markdown - Document your analysis
Cell structure - One logical operation per cell
Parameters - Use widgets for parameterized notebooks
Version control - Integrate with Git repos

Performance Tips¶

Cache DataFrames - df.cache() for repeated operations
Partition data - Use partitionBy() for large datasets
Optimize file sizes - 128MB-1GB per file
Use Delta Lake - Better performance than Parquet

🔧 Troubleshooting¶

Common Issues¶

Cluster won't start

✅ Check quota in subscription
✅ Verify VM type available in region
✅ Check networking/firewall rules

Out of memory errors

✅ Increase worker node size
✅ Add more workers
✅ Optimize DataFrame operations
✅ Use broadcast joins for small tables

Slow performance

✅ Check data skew
✅ Optimize partition count
✅ Use Delta Lake optimizations
✅ Enable adaptive query execution

Cannot access storage

✅ Verify credentials
✅ Check firewall rules
✅ Ensure managed identity has permissions

🎓 Next Steps¶

Beginner Practice¶

Load your own CSV data
Create more complex SQL queries
Build visualizations
Schedule a notebook as a job

Intermediate Challenges¶

Implement ETL pipeline
Use Delta Lake time travel
Create parameterized widgets
Integrate with Azure Data Factory

Advanced Topics¶

Build ML models with MLflow
Implement streaming with Structured Streaming
Use Auto Loader for incremental data
Set up CI/CD with Azure DevOps

📚 Additional Resources¶

Documentation¶

Next Tutorials¶

Databricks Notebooks - Deep dive into notebooks
Delta Lake Basics - Learn Delta Lake features
Data Engineer Path

Training¶

Databricks Academy - Free courses
Databricks Community Edition - Free tier
Microsoft Learn Databricks

🧹 Cleanup¶

To avoid Azure charges:

Delete Cluster¶

Navigate to "Compute"
Click cluster name
Click "Terminate"
Click "Confirm"

Delete Workspace¶

# Delete resource group
az group delete --name rg-databricks-quickstart --yes --no-wait

Or use Azure Portal:

Navigate to Resource Groups
Select "rg-databricks-quickstart"
Click "Delete resource group"
Type resource group name to confirm
Click "Delete"

🎉 Congratulations!¶

You've successfully:

✅ Created Azure Databricks workspace ✅ Launched and configured a Spark cluster ✅ Created and ran interactive notebooks ✅ Processed data with PySpark ✅ Saved data to Delta Lake ✅ Built visualizations

You're ready to build big data solutions with Azure Databricks!

Next Recommended Tutorial: Databricks Notebooks for advanced notebook techniques

Last Updated: January 2025 Tutorial Version: 1.0 Tested with: Databricks Runtime 12.2 LTS