📓 Azure Databricks Notebooks Introduction¶
Master Databricks notebooks for interactive data analysis. Learn notebook features, best practices, and collaboration techniques.
🎯 Learning Objectives¶
After completing this tutorial, you will be able to:
- Understand notebook structure and capabilities
- Use multiple languages in one notebook
- Create interactive visualizations
- Implement notebook widgets for parameters
- Share and collaborate on notebooks
- Use notebook utilities and magic commands
📋 Prerequisites¶
- Databricks workspace - Create one
- Active cluster - Running Spark cluster
- Basic Python/SQL knowledge
📘 What Are Databricks Notebooks?¶
Notebooks are interactive documents that combine:
- Code - Python, Scala, SQL, R
- Visualizations - Charts, graphs, maps
- Markdown - Documentation and explanations
- Results - Output from code execution
Key Features¶
✅ Multi-language support ✅ Collaborative editing ✅ Version control integration ✅ Scheduled execution ✅ Interactive widgets ✅ Export capabilities (HTML, DBC, iPython)
🎨 Notebook Structure¶
Create a New Notebook¶
- Click "Workspace" → Your user folder
- Click dropdown → "Create" → "Notebook"
- Name: "Notebook Tutorial"
- Language: Python
- Cluster: Select your cluster
Cell Types¶
%md
# Markdown Cell
Use markdown for documentation:
- **Bold text**
- *Italic text*
- [Links](https://databricks.com)
- `Code` formatting
## Headers and formatting
🔤 Multi-Language Support¶
Language Magic Commands¶
Mixing Languages Example¶
# Cell 5: Create data in Python
%python
sales_data = [
("2024-01", 10000),
("2024-02", 15000),
("2024-03", 12000)
]
df = spark.createDataFrame(sales_data, ["month", "revenue"])
df.createOrReplaceTempView("sales")
-- Cell 6: Query with SQL
%sql
SELECT
month,
revenue,
SUM(revenue) OVER (ORDER BY month) as running_total
FROM sales
📊 Visualizations¶
Display Function¶
# Cell 7: Create sample data
from pyspark.sql.functions import col, rand
# Generate data
df = spark.range(0, 100) \
.withColumn("category", (col("id") % 5).cast("string")) \
.withColumn("value", (rand() * 100).cast("int"))
# Display with built-in viz
display(df)
Chart Types¶
After running display():
- Click chart icon
- Select chart type:
- Bar Chart
- Line Chart
- Pie Chart
- Scatter Plot
- Map
- Configure axes and aggregations
Custom Visualizations¶
# Cell 8: Use matplotlib
%python
import matplotlib.pyplot as plt
import pandas as pd
# Convert to Pandas for plotting
pdf = df.groupBy("category").count().toPandas()
plt.figure(figsize=(10, 6))
plt.bar(pdf['category'], pdf['count'])
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Items by Category')
plt.show()
# Cell 9: Use plotly for interactive charts
%python
import plotly.express as px
# Convert to Pandas
pdf = df.toPandas()
fig = px.scatter(pdf, x="id", y="value", color="category",
title="Interactive Scatter Plot")
fig.show()
🎛️ Notebook Widgets¶
Widgets create interactive parameters for notebooks.
Text Widget¶
# Cell 10: Create text widget
dbutils.widgets.text("customer_id", "C101", "Customer ID")
# Get widget value
customer_id = dbutils.widgets.get("customer_id")
print(f"Selected customer: {customer_id}")
Dropdown Widget¶
# Cell 11: Create dropdown
dbutils.widgets.dropdown("category", "Electronics",
["Electronics", "Furniture", "Clothing"],
"Product Category")
category = dbutils.widgets.get("category")
print(f"Selected category: {category}")
Multiselect Widget¶
# Cell 12: Create multiselect
dbutils.widgets.multiselect("regions", "US",
["US", "EU", "APAC", "LATAM"],
"Regions")
regions = dbutils.widgets.get("regions").split(",")
print(f"Selected regions: {regions}")
Remove Widgets¶
# Cell 13: Remove specific or all widgets
dbutils.widgets.remove("customer_id") # Remove one
# dbutils.widgets.removeAll() # Remove all
🛠️ Notebook Utilities (dbutils)¶
File System Operations¶
# Cell 14: File system commands
%fs ls /
# Using dbutils
files = dbutils.fs.ls("/")
for file in files:
print(f"{file.name} - {file.size} bytes")
Secrets Management¶
# Cell 15: Access secrets
# First, create secret scope (via CLI or API)
# Retrieve secret
storage_key = dbutils.secrets.get(scope="my-scope", key="storage-key")
# Use secret (value hidden in logs)
print(f"Secret retrieved: {'*' * 10}")
Notebook Workflows¶
# Cell 16: Run another notebook
result = dbutils.notebook.run(
"/Users/user@example.com/data-processing",
timeout_seconds=300,
arguments={"date": "2024-01-01"}
)
print(f"Result: {result}")
📝 Best Practices¶
1. Document Your Code¶
%md
# Data Processing Pipeline
## Overview
This notebook processes sales data from ADLS Gen2.
## Steps
1. Load raw data
2. Clean and transform
3. Save to Delta Lake
## Author: Your Name
## Last Updated: 2024-01-09
2. Organize with Sections¶
# === Configuration ===
storage_account = "mystorageaccount"
container = "data"
# === Data Loading ===
df = spark.read.csv(f"abfss://{container}@{storage_account}.dfs.core.windows.net/sales.csv")
# === Data Processing ===
df_clean = df.filter(col("amount") > 0)
# === Save Results ===
df_clean.write.format("delta").save("/mnt/processed/sales")
3. Use Functions for Reusability¶
# Cell 17: Define reusable functions
def load_data(path, format="csv"):
"""Load data from storage"""
return spark.read.format(format) \
.option("header", "true") \
.option("inferSchema", "true") \
.load(path)
def save_to_delta(df, table_name):
"""Save DataFrame to Delta table"""
df.write.format("delta") \
.mode("overwrite") \
.saveAsTable(table_name)
print(f"✅ Saved to {table_name}")
4. Error Handling¶
# Cell 18: Implement error handling
try:
df = spark.read.csv("/path/to/file.csv")
print(f"✅ Loaded {df.count()} rows")
except Exception as e:
print(f"❌ Error loading data: {str(e)}")
# Handle error or log
🤝 Collaboration Features¶
Comments¶
# Cell 19: Add comments to cells
# Click the comment icon in the cell menu
# Team members can reply to comments
# Use @mention to notify specific users
print("Code with comments for team review")
Version Control¶
- Git Integration
- Workspace → Settings → Git Integration
- Connect to Azure DevOps or GitHub
-
Commit, push, pull directly from notebooks
-
Revision History
- File → Revision History
- See all changes
- Restore previous versions
Sharing¶
- Share Notebook
- Click "Share" button
-
Set permissions:
- Can View
- Can Run
- Can Edit
- Can Manage
-
Export Notebook
- File → Export
- Formats: DBC, HTML, iPython, Source File
⚡ Advanced Features¶
Parameterized Notebooks¶
# Cell 20: Define parameters
dbutils.widgets.text("date", "2024-01-01")
dbutils.widgets.text("environment", "prod")
# Use in code
process_date = dbutils.widgets.get("date")
env = dbutils.widgets.get("environment")
print(f"Processing data for {process_date} in {env}")
Databricks Connect¶
Run notebooks from local IDE:
# Local development
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder \
.remote(
host="<workspace-url>",
token="<personal-access-token>",
cluster_id="<cluster-id>"
) \
.getOrCreate()
Notebook Jobs¶
Schedule notebooks:
- Workflows → Create Job
- Add notebook task
- Set schedule (cron)
- Configure cluster
- Set parameters
- Enable notifications
🔧 Troubleshooting¶
Common Issues¶
Cell Won't Execute
- ✅ Check cluster status (running?)
- ✅ Verify cluster attached to notebook
- ✅ Check for syntax errors
Cannot See Widget
- ✅ Run widget creation cell
- ✅ Check if widgets hidden (View → Show Widgets)
- ✅ Clear output and re-run
Import Errors
- ✅ Install library on cluster
- ✅ Use
%pip install package-name - ✅ Restart Python kernel
Notebook Too Slow
- ✅ Clear unused DataFrames
- ✅ Unpersist cached data
- ✅ Restart cluster
- ✅ Use larger cluster
🎓 Practice Exercises¶
Exercise 1: Interactive Dashboard¶
Create a notebook with:
- Dropdown for date range selection
- Load data based on selection
- Display summary statistics
- Show 3 different visualizations
Exercise 2: Reusable Functions¶
Build a notebook library with:
- Data loading functions
- Transformation utilities
- Validation checks
- Error handling
Exercise 3: Multi-Language Pipeline¶
Create pipeline using:
- Python for data loading
- SQL for aggregations
- Python for ML model
- SQL for results
📚 Additional Resources¶
Documentation¶
Next Tutorials¶
- Delta Lake Basics - Work with Delta tables
- Spark SQL Tutorial
- ML on Databricks
🎉 Summary¶
You've learned:
✅ Notebook structure and features ✅ Multi-language support ✅ Visualization techniques ✅ Interactive widgets ✅ Collaboration features ✅ Best practices
Ready to build interactive data applications!
Last Updated: January 2025 Tutorial Version: 1.0