Skip to content

🔥 Azure Databricks Quickstart

Status Level Duration

Get started with Azure Databricks in under an hour. Learn to create a workspace, run Spark notebooks, and process data at scale.

🎯 Learning Objectives

After completing this quickstart, you will be able to:

  • Understand what Azure Databricks is and its capabilities
  • Create a Databricks workspace
  • Launch and configure a Spark cluster
  • Create and run notebooks with PySpark
  • Load and analyze data from ADLS Gen2
  • Visualize results with built-in charts

📋 Prerequisites

Before starting, ensure you have:

🔍 What is Azure Databricks?

Azure Databricks is an Apache Spark-based analytics platform optimized for Azure, providing:

  • Unified analytics - Data engineering, data science, and ML
  • Collaborative notebooks - Interactive development environment
  • Auto-scaling clusters - Elastic compute resources
  • Delta Lake - Reliable data lakes with ACID transactions
  • Integration - Seamless Azure service connectivity

Key Concepts

  • Workspace: Environment for notebooks, clusters, and data
  • Cluster: Set of computation resources (VMs)
  • Notebook: Interactive document with code and visualizations
  • Delta Lake: Storage layer with reliability and performance
  • Job: Scheduled execution of notebooks or JARs

When to Use Databricks

Good For:

  • Big data processing (TB-PB scale)
  • ETL/ELT pipelines
  • Machine learning workflows
  • Real-time streaming analytics
  • Collaborative data science

Not Ideal For:

  • Small datasets (<1GB)
  • Simple SQL queries (use Synapse Serverless)
  • Real-time low-latency apps (use Event Hubs)

🚀 Step 1: Create Databricks Workspace

Using Azure Portal

  1. Navigate to Azure Portal
  2. Go to portal.azure.com
  3. Click "Create a resource"
  4. Search for "Azure Databricks"
  5. Click "Create"

  6. Configure Basics

  7. Subscription: Select your subscription
  8. Resource Group: Create new "rg-databricks-quickstart"
  9. Workspace Name: "databricks-quickstart-[yourname]"
  10. Region: Select nearest region
  11. Pricing Tier: Trial (Premium - 14 days free) or Standard

  12. Networking (Optional)

  13. Keep default public network access for quickstart

  14. Review and Create

  15. Click "Review + create"
  16. Click "Create"
  17. Wait 3-5 minutes for deployment

Using Azure CLI

# Set variables
RESOURCE_GROUP="rg-databricks-quickstart"
LOCATION="eastus"
WORKSPACE_NAME="databricks-quickstart-$RANDOM"

# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION

# Create Databricks workspace
az databricks workspace create \
  --resource-group $RESOURCE_GROUP \
  --name $WORKSPACE_NAME \
  --location $LOCATION \
  --sku trial

echo "Workspace: $WORKSPACE_NAME"

🖥️ Step 2: Launch Workspace

  1. Navigate to Workspace
  2. Go to your Databricks workspace in Azure Portal
  3. Click "Launch Workspace"
  4. You'll be redirected to Databricks portal

  5. First Time Setup

  6. May need to sign in with Azure credentials
  7. Accept terms if prompted

⚙️ Step 3: Create Cluster

Create Compute Cluster

  1. Navigate to Compute
  2. Click "Compute" icon in left sidebar
  3. Click "Create Cluster"

  4. Configure Cluster

  5. Cluster Name: "quickstart-cluster"
  6. Cluster Mode: Single Node (for quickstart/development)
  7. Databricks Runtime: Latest LTS version (e.g., 12.2 LTS)
  8. Node Type: Standard_DS3_v2 (or smallest available)
  9. Terminate after: 30 minutes of inactivity

  10. Create Cluster

  11. Click "Create Cluster"
  12. Wait 3-5 minutes for cluster to start
  13. Status changes to "Running" when ready

💡 Tip: Single Node mode is cheaper for learning. Use Standard mode for production workloads.

📓 Step 4: Create Your First Notebook

Create Notebook

  1. Navigate to Workspace
  2. Click "Workspace" icon in left sidebar
  3. Click your username folder
  4. Click dropdown arrow > "Create" > "Notebook"

  5. Configure Notebook

  6. Name: "Quickstart Tutorial"
  7. Default Language: Python
  8. Cluster: Select "quickstart-cluster"
  9. Click "Create"

Run Your First Code

# Cell 1: Hello Databricks!
print("🔥 Hello from Databricks!")
print(f"Spark version: {spark.version}")
print(f"Running on cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")

Click "Run Cell" or press Shift+Enter

📊 Step 5: Work with DataFrames

Create Sample Data

# Cell 2: Create sample sales data
from pyspark.sql import Row
from datetime import datetime, timedelta

# Generate sample data
sales_data = [
    Row(order_id=1001, customer_id="C101", product="Laptop", amount=1299.99, order_date="2024-01-15"),
    Row(order_id=1002, customer_id="C102", product="Chair", amount=249.99, order_date="2024-01-15"),
    Row(order_id=1003, customer_id="C101", product="Monitor", amount=399.99, order_date="2024-01-16"),
    Row(order_id=1004, customer_id="C103", product="Desk", amount=549.99, order_date="2024-01-16"),
    Row(order_id=1005, customer_id="C102", product="Keyboard", amount=89.99, order_date="2024-01-17"),
    Row(order_id=1006, customer_id="C104", product="Mouse", amount=39.99, order_date="2024-01-17"),
    Row(order_id=1007, customer_id="C101", product="Lamp", amount=79.99, order_date="2024-01-18"),
    Row(order_id=1008, customer_id="C105", product="Tablet", amount=599.99, order_date="2024-01-18"),
]

# Create DataFrame
df = spark.createDataFrame(sales_data)

# Display DataFrame
display(df)

Basic DataFrame Operations

# Cell 3: Explore the data
print(f"Total orders: {df.count()}")
print(f"\nColumns: {df.columns}")
print(f"\nSchema:")
df.printSchema()
# Cell 4: Filter and aggregate
from pyspark.sql.functions import sum, avg, count

# Calculate total sales
total_sales = df.select(sum("amount")).collect()[0][0]
print(f"Total sales: ${total_sales:,.2f}")

# Sales by customer
customer_sales = df.groupBy("customer_id") \
    .agg(
        count("order_id").alias("order_count"),
        sum("amount").alias("total_spent")
    ) \
    .orderBy("total_spent", ascending=False)

display(customer_sales)

📈 Step 6: Visualizations

Create Charts

# Cell 5: Sales by customer
customer_sales = df.groupBy("customer_id") \
    .agg(sum("amount").alias("total_sales")) \
    .orderBy("total_sales", ascending=False)

display(customer_sales)

After running the cell:

  1. Click the chart icon below results
  2. Select "Bar Chart"
  3. Configure:
  4. Keys: customer_id
  5. Values: total_sales
  6. Click "Apply"

💾 Step 7: Write Data to Delta Lake

Save as Delta Table

# Cell 6: Write to Delta Lake
# Delta Lake provides ACID transactions and time travel

# Write DataFrame to Delta table
df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("sales_data")

print("✅ Data saved to Delta table: sales_data")

Read from Delta Table

# Cell 7: Read from Delta table
sales_df = spark.table("sales_data")

print(f"Loaded {sales_df.count()} rows from Delta table")
display(sales_df)

🔗 Step 8: Connect to ADLS Gen2 (Optional)

If you have an ADLS Gen2 account:

# Cell 8: Configure ADLS Gen2 access
storage_account = "your-storage-account"
container = "data"

# Set Spark configuration for storage access
spark.conf.set(
    f"fs.azure.account.key.{storage_account}.dfs.core.windows.net",
    "your-storage-account-key"  # Get from Azure Portal > Access Keys
)

# Read CSV from ADLS Gen2
adls_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/sales.csv"

df_adls = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(adls_path)

display(df_adls)

Better: Use Service Principal

# Cell 9: Service Principal authentication (recommended for production)
service_principal_id = "your-service-principal-id"
service_principal_secret = "your-service-principal-secret"
tenant_id = "your-tenant-id"

spark.conf.set(
    f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net",
    "OAuth"
)
spark.conf.set(
    f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net",
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net",
    service_principal_id
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net",
    service_principal_secret
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net",
    f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
)

🎯 Useful Databricks Features

Magic Commands

# %python - Run Python code (default)
# %sql - Run SQL queries
# %scala - Run Scala code
# %r - Run R code
# %sh - Run shell commands
# %fs - File system commands
# %md - Markdown for documentation

File System Commands

# Cell 10: Explore DBFS (Databricks File System)
%fs ls /

# List Delta tables
%fs ls /user/hive/warehouse/

# Create directory
%fs mkdirs /tmp/mydata

SQL Queries

# Cell 11: Query with SQL
%sql
SELECT
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent
FROM sales_data
GROUP BY customer_id
ORDER BY total_spent DESC

💡 Best Practices

Cluster Management

  1. Use auto-termination - Set 30-60 minutes for dev clusters
  2. Right-size clusters - Start small, scale up if needed
  3. Use spot instances - 60-80% cost savings for fault-tolerant workloads
  4. Pools - Pre-warmed VMs for faster cluster starts

Notebook Organization

  1. Use markdown - Document your analysis
  2. Cell structure - One logical operation per cell
  3. Parameters - Use widgets for parameterized notebooks
  4. Version control - Integrate with Git repos

Performance Tips

  1. Cache DataFrames - df.cache() for repeated operations
  2. Partition data - Use partitionBy() for large datasets
  3. Optimize file sizes - 128MB-1GB per file
  4. Use Delta Lake - Better performance than Parquet

🔧 Troubleshooting

Common Issues

Cluster won't start

  • ✅ Check quota in subscription
  • ✅ Verify VM type available in region
  • ✅ Check networking/firewall rules

Out of memory errors

  • ✅ Increase worker node size
  • ✅ Add more workers
  • ✅ Optimize DataFrame operations
  • ✅ Use broadcast joins for small tables

Slow performance

  • ✅ Check data skew
  • ✅ Optimize partition count
  • ✅ Use Delta Lake optimizations
  • ✅ Enable adaptive query execution

Cannot access storage

  • ✅ Verify credentials
  • ✅ Check firewall rules
  • ✅ Ensure managed identity has permissions

🎓 Next Steps

Beginner Practice

  • Load your own CSV data
  • Create more complex SQL queries
  • Build visualizations
  • Schedule a notebook as a job

Intermediate Challenges

  • Implement ETL pipeline
  • Use Delta Lake time travel
  • Create parameterized widgets
  • Integrate with Azure Data Factory

Advanced Topics

  • Build ML models with MLflow
  • Implement streaming with Structured Streaming
  • Use Auto Loader for incremental data
  • Set up CI/CD with Azure DevOps

📚 Additional Resources

Documentation

Next Tutorials

Training

🧹 Cleanup

To avoid Azure charges:

Delete Cluster

  1. Navigate to "Compute"
  2. Click cluster name
  3. Click "Terminate"
  4. Click "Confirm"

Delete Workspace

# Delete resource group
az group delete --name rg-databricks-quickstart --yes --no-wait

Or use Azure Portal:

  1. Navigate to Resource Groups
  2. Select "rg-databricks-quickstart"
  3. Click "Delete resource group"
  4. Type resource group name to confirm
  5. Click "Delete"

🎉 Congratulations!

You've successfully:

✅ Created Azure Databricks workspace ✅ Launched and configured a Spark cluster ✅ Created and ran interactive notebooks ✅ Processed data with PySpark ✅ Saved data to Delta Lake ✅ Built visualizations

You're ready to build big data solutions with Azure Databricks!


Next Recommended Tutorial: Databricks Notebooks for advanced notebook techniques


Last Updated: January 2025 Tutorial Version: 1.0 Tested with: Databricks Runtime 12.2 LTS