🔥 Azure Databricks Quickstart¶
Get started with Azure Databricks in under an hour. Learn to create a workspace, run Spark notebooks, and process data at scale.
🎯 Learning Objectives¶
After completing this quickstart, you will be able to:
- Understand what Azure Databricks is and its capabilities
- Create a Databricks workspace
- Launch and configure a Spark cluster
- Create and run notebooks with PySpark
- Load and analyze data from ADLS Gen2
- Visualize results with built-in charts
📋 Prerequisites¶
Before starting, ensure you have:
- Azure subscription - Create free account
- Azure Portal access - portal.azure.com
- Basic Python knowledge - Understanding of variables, loops, functions
- ADLS Gen2 account (optional) - Create one or use sample data
🔍 What is Azure Databricks?¶
Azure Databricks is an Apache Spark-based analytics platform optimized for Azure, providing:
- Unified analytics - Data engineering, data science, and ML
- Collaborative notebooks - Interactive development environment
- Auto-scaling clusters - Elastic compute resources
- Delta Lake - Reliable data lakes with ACID transactions
- Integration - Seamless Azure service connectivity
Key Concepts¶
- Workspace: Environment for notebooks, clusters, and data
- Cluster: Set of computation resources (VMs)
- Notebook: Interactive document with code and visualizations
- Delta Lake: Storage layer with reliability and performance
- Job: Scheduled execution of notebooks or JARs
When to Use Databricks¶
✅ Good For:
- Big data processing (TB-PB scale)
- ETL/ELT pipelines
- Machine learning workflows
- Real-time streaming analytics
- Collaborative data science
❌ Not Ideal For:
- Small datasets (<1GB)
- Simple SQL queries (use Synapse Serverless)
- Real-time low-latency apps (use Event Hubs)
🚀 Step 1: Create Databricks Workspace¶
Using Azure Portal¶
- Navigate to Azure Portal
- Go to portal.azure.com
- Click "Create a resource"
- Search for "Azure Databricks"
-
Click "Create"
-
Configure Basics
- Subscription: Select your subscription
- Resource Group: Create new "rg-databricks-quickstart"
- Workspace Name: "databricks-quickstart-[yourname]"
- Region: Select nearest region
-
Pricing Tier: Trial (Premium - 14 days free) or Standard
-
Networking (Optional)
-
Keep default public network access for quickstart
-
Review and Create
- Click "Review + create"
- Click "Create"
- Wait 3-5 minutes for deployment
Using Azure CLI¶
# Set variables
RESOURCE_GROUP="rg-databricks-quickstart"
LOCATION="eastus"
WORKSPACE_NAME="databricks-quickstart-$RANDOM"
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
# Create Databricks workspace
az databricks workspace create \
--resource-group $RESOURCE_GROUP \
--name $WORKSPACE_NAME \
--location $LOCATION \
--sku trial
echo "Workspace: $WORKSPACE_NAME"
🖥️ Step 2: Launch Workspace¶
- Navigate to Workspace
- Go to your Databricks workspace in Azure Portal
- Click "Launch Workspace"
-
You'll be redirected to Databricks portal
-
First Time Setup
- May need to sign in with Azure credentials
- Accept terms if prompted
⚙️ Step 3: Create Cluster¶
Create Compute Cluster¶
- Navigate to Compute
- Click "Compute" icon in left sidebar
-
Click "Create Cluster"
-
Configure Cluster
- Cluster Name: "quickstart-cluster"
- Cluster Mode: Single Node (for quickstart/development)
- Databricks Runtime: Latest LTS version (e.g., 12.2 LTS)
- Node Type: Standard_DS3_v2 (or smallest available)
-
Terminate after: 30 minutes of inactivity
-
Create Cluster
- Click "Create Cluster"
- Wait 3-5 minutes for cluster to start
- Status changes to "Running" when ready
💡 Tip: Single Node mode is cheaper for learning. Use Standard mode for production workloads.
📓 Step 4: Create Your First Notebook¶
Create Notebook¶
- Navigate to Workspace
- Click "Workspace" icon in left sidebar
- Click your username folder
-
Click dropdown arrow > "Create" > "Notebook"
-
Configure Notebook
- Name: "Quickstart Tutorial"
- Default Language: Python
- Cluster: Select "quickstart-cluster"
- Click "Create"
Run Your First Code¶
# Cell 1: Hello Databricks!
print("🔥 Hello from Databricks!")
print(f"Spark version: {spark.version}")
print(f"Running on cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")
Click "Run Cell" or press Shift+Enter
📊 Step 5: Work with DataFrames¶
Create Sample Data¶
# Cell 2: Create sample sales data
from pyspark.sql import Row
from datetime import datetime, timedelta
# Generate sample data
sales_data = [
Row(order_id=1001, customer_id="C101", product="Laptop", amount=1299.99, order_date="2024-01-15"),
Row(order_id=1002, customer_id="C102", product="Chair", amount=249.99, order_date="2024-01-15"),
Row(order_id=1003, customer_id="C101", product="Monitor", amount=399.99, order_date="2024-01-16"),
Row(order_id=1004, customer_id="C103", product="Desk", amount=549.99, order_date="2024-01-16"),
Row(order_id=1005, customer_id="C102", product="Keyboard", amount=89.99, order_date="2024-01-17"),
Row(order_id=1006, customer_id="C104", product="Mouse", amount=39.99, order_date="2024-01-17"),
Row(order_id=1007, customer_id="C101", product="Lamp", amount=79.99, order_date="2024-01-18"),
Row(order_id=1008, customer_id="C105", product="Tablet", amount=599.99, order_date="2024-01-18"),
]
# Create DataFrame
df = spark.createDataFrame(sales_data)
# Display DataFrame
display(df)
Basic DataFrame Operations¶
# Cell 3: Explore the data
print(f"Total orders: {df.count()}")
print(f"\nColumns: {df.columns}")
print(f"\nSchema:")
df.printSchema()
# Cell 4: Filter and aggregate
from pyspark.sql.functions import sum, avg, count
# Calculate total sales
total_sales = df.select(sum("amount")).collect()[0][0]
print(f"Total sales: ${total_sales:,.2f}")
# Sales by customer
customer_sales = df.groupBy("customer_id") \
.agg(
count("order_id").alias("order_count"),
sum("amount").alias("total_spent")
) \
.orderBy("total_spent", ascending=False)
display(customer_sales)
📈 Step 6: Visualizations¶
Create Charts¶
# Cell 5: Sales by customer
customer_sales = df.groupBy("customer_id") \
.agg(sum("amount").alias("total_sales")) \
.orderBy("total_sales", ascending=False)
display(customer_sales)
After running the cell:
- Click the chart icon below results
- Select "Bar Chart"
- Configure:
- Keys: customer_id
- Values: total_sales
- Click "Apply"
💾 Step 7: Write Data to Delta Lake¶
Save as Delta Table¶
# Cell 6: Write to Delta Lake
# Delta Lake provides ACID transactions and time travel
# Write DataFrame to Delta table
df.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("sales_data")
print("✅ Data saved to Delta table: sales_data")
Read from Delta Table¶
# Cell 7: Read from Delta table
sales_df = spark.table("sales_data")
print(f"Loaded {sales_df.count()} rows from Delta table")
display(sales_df)
🔗 Step 8: Connect to ADLS Gen2 (Optional)¶
If you have an ADLS Gen2 account:
# Cell 8: Configure ADLS Gen2 access
storage_account = "your-storage-account"
container = "data"
# Set Spark configuration for storage access
spark.conf.set(
f"fs.azure.account.key.{storage_account}.dfs.core.windows.net",
"your-storage-account-key" # Get from Azure Portal > Access Keys
)
# Read CSV from ADLS Gen2
adls_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/sales.csv"
df_adls = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(adls_path)
display(df_adls)
Better: Use Service Principal¶
# Cell 9: Service Principal authentication (recommended for production)
service_principal_id = "your-service-principal-id"
service_principal_secret = "your-service-principal-secret"
tenant_id = "your-tenant-id"
spark.conf.set(
f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net",
"OAuth"
)
spark.conf.set(
f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net",
service_principal_id
)
spark.conf.set(
f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net",
service_principal_secret
)
spark.conf.set(
f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net",
f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
)
🎯 Useful Databricks Features¶
Magic Commands¶
# %python - Run Python code (default)
# %sql - Run SQL queries
# %scala - Run Scala code
# %r - Run R code
# %sh - Run shell commands
# %fs - File system commands
# %md - Markdown for documentation
File System Commands¶
# Cell 10: Explore DBFS (Databricks File System)
%fs ls /
# List Delta tables
%fs ls /user/hive/warehouse/
# Create directory
%fs mkdirs /tmp/mydata
SQL Queries¶
# Cell 11: Query with SQL
%sql
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM sales_data
GROUP BY customer_id
ORDER BY total_spent DESC
💡 Best Practices¶
Cluster Management¶
- Use auto-termination - Set 30-60 minutes for dev clusters
- Right-size clusters - Start small, scale up if needed
- Use spot instances - 60-80% cost savings for fault-tolerant workloads
- Pools - Pre-warmed VMs for faster cluster starts
Notebook Organization¶
- Use markdown - Document your analysis
- Cell structure - One logical operation per cell
- Parameters - Use widgets for parameterized notebooks
- Version control - Integrate with Git repos
Performance Tips¶
- Cache DataFrames -
df.cache()for repeated operations - Partition data - Use
partitionBy()for large datasets - Optimize file sizes - 128MB-1GB per file
- Use Delta Lake - Better performance than Parquet
🔧 Troubleshooting¶
Common Issues¶
Cluster won't start
- ✅ Check quota in subscription
- ✅ Verify VM type available in region
- ✅ Check networking/firewall rules
Out of memory errors
- ✅ Increase worker node size
- ✅ Add more workers
- ✅ Optimize DataFrame operations
- ✅ Use broadcast joins for small tables
Slow performance
- ✅ Check data skew
- ✅ Optimize partition count
- ✅ Use Delta Lake optimizations
- ✅ Enable adaptive query execution
Cannot access storage
- ✅ Verify credentials
- ✅ Check firewall rules
- ✅ Ensure managed identity has permissions
🎓 Next Steps¶
Beginner Practice¶
- Load your own CSV data
- Create more complex SQL queries
- Build visualizations
- Schedule a notebook as a job
Intermediate Challenges¶
- Implement ETL pipeline
- Use Delta Lake time travel
- Create parameterized widgets
- Integrate with Azure Data Factory
Advanced Topics¶
- Build ML models with MLflow
- Implement streaming with Structured Streaming
- Use Auto Loader for incremental data
- Set up CI/CD with Azure DevOps
📚 Additional Resources¶
Documentation¶
Next Tutorials¶
- Databricks Notebooks - Deep dive into notebooks
- Delta Lake Basics - Learn Delta Lake features
- Data Engineer Path
Training¶
- Databricks Academy - Free courses
- Databricks Community Edition - Free tier
- Microsoft Learn Databricks
🧹 Cleanup¶
To avoid Azure charges:
Delete Cluster¶
- Navigate to "Compute"
- Click cluster name
- Click "Terminate"
- Click "Confirm"
Delete Workspace¶
Or use Azure Portal:
- Navigate to Resource Groups
- Select "rg-databricks-quickstart"
- Click "Delete resource group"
- Type resource group name to confirm
- Click "Delete"
🎉 Congratulations!¶
You've successfully:
✅ Created Azure Databricks workspace ✅ Launched and configured a Spark cluster ✅ Created and ran interactive notebooks ✅ Processed data with PySpark ✅ Saved data to Delta Lake ✅ Built visualizations
You're ready to build big data solutions with Azure Databricks!
Next Recommended Tutorial: Databricks Notebooks for advanced notebook techniques
Last Updated: January 2025 Tutorial Version: 1.0 Tested with: Databricks Runtime 12.2 LTS