Skip to content

📓 Azure Databricks Notebooks Introduction

Status Level Duration

Master Databricks notebooks for interactive data analysis. Learn notebook features, best practices, and collaboration techniques.

🎯 Learning Objectives

After completing this tutorial, you will be able to:

  • Understand notebook structure and capabilities
  • Use multiple languages in one notebook
  • Create interactive visualizations
  • Implement notebook widgets for parameters
  • Share and collaborate on notebooks
  • Use notebook utilities and magic commands

📋 Prerequisites

  • Databricks workspace - Create one
  • Active cluster - Running Spark cluster
  • Basic Python/SQL knowledge

📘 What Are Databricks Notebooks?

Notebooks are interactive documents that combine:

  • Code - Python, Scala, SQL, R
  • Visualizations - Charts, graphs, maps
  • Markdown - Documentation and explanations
  • Results - Output from code execution

Key Features

✅ Multi-language support ✅ Collaborative editing ✅ Version control integration ✅ Scheduled execution ✅ Interactive widgets ✅ Export capabilities (HTML, DBC, iPython)

🎨 Notebook Structure

Create a New Notebook

  1. Click "Workspace" → Your user folder
  2. Click dropdown → "Create" → "Notebook"
  3. Name: "Notebook Tutorial"
  4. Language: Python
  5. Cluster: Select your cluster

Cell Types

# Code Cell (Default)
print("This is a code cell")
result = 2 + 2
print(f"Result: {result}")
%md
# Markdown Cell

Use markdown for documentation:
- **Bold text**
- *Italic text*
- [Links](https://databricks.com)
- `Code` formatting

## Headers and formatting

🔤 Multi-Language Support

Language Magic Commands

# Cell 1: Python (default)
%python
data = [1, 2, 3, 4, 5]
print(f"Python: Sum = {sum(data)}")
-- Cell 2: SQL
%sql
SELECT 'Hello from SQL' as message
// Cell 3: Scala
%scala
val numbers = List(1, 2, 3, 4, 5)
println(s"Scala: Sum = ${numbers.sum}")
# Cell 4: R
%r
numbers <- c(1, 2, 3, 4, 5)
cat(sprintf("R: Sum = %d", sum(numbers)))

Mixing Languages Example

# Cell 5: Create data in Python
%python
sales_data = [
    ("2024-01", 10000),
    ("2024-02", 15000),
    ("2024-03", 12000)
]

df = spark.createDataFrame(sales_data, ["month", "revenue"])
df.createOrReplaceTempView("sales")
-- Cell 6: Query with SQL
%sql
SELECT
    month,
    revenue,
    SUM(revenue) OVER (ORDER BY month) as running_total
FROM sales

📊 Visualizations

Display Function

# Cell 7: Create sample data
from pyspark.sql.functions import col, rand

# Generate data
df = spark.range(0, 100) \
    .withColumn("category", (col("id") % 5).cast("string")) \
    .withColumn("value", (rand() * 100).cast("int"))

# Display with built-in viz
display(df)

Chart Types

After running display():

  1. Click chart icon
  2. Select chart type:
  3. Bar Chart
  4. Line Chart
  5. Pie Chart
  6. Scatter Plot
  7. Map
  8. Configure axes and aggregations

Custom Visualizations

# Cell 8: Use matplotlib
%python
import matplotlib.pyplot as plt
import pandas as pd

# Convert to Pandas for plotting
pdf = df.groupBy("category").count().toPandas()

plt.figure(figsize=(10, 6))
plt.bar(pdf['category'], pdf['count'])
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Items by Category')
plt.show()
# Cell 9: Use plotly for interactive charts
%python
import plotly.express as px

# Convert to Pandas
pdf = df.toPandas()

fig = px.scatter(pdf, x="id", y="value", color="category",
                 title="Interactive Scatter Plot")
fig.show()

🎛️ Notebook Widgets

Widgets create interactive parameters for notebooks.

Text Widget

# Cell 10: Create text widget
dbutils.widgets.text("customer_id", "C101", "Customer ID")

# Get widget value
customer_id = dbutils.widgets.get("customer_id")
print(f"Selected customer: {customer_id}")
# Cell 11: Create dropdown
dbutils.widgets.dropdown("category", "Electronics",
                         ["Electronics", "Furniture", "Clothing"],
                         "Product Category")

category = dbutils.widgets.get("category")
print(f"Selected category: {category}")

Multiselect Widget

# Cell 12: Create multiselect
dbutils.widgets.multiselect("regions", "US",
                            ["US", "EU", "APAC", "LATAM"],
                            "Regions")

regions = dbutils.widgets.get("regions").split(",")
print(f"Selected regions: {regions}")

Remove Widgets

# Cell 13: Remove specific or all widgets
dbutils.widgets.remove("customer_id")  # Remove one
# dbutils.widgets.removeAll()  # Remove all

🛠️ Notebook Utilities (dbutils)

File System Operations

# Cell 14: File system commands
%fs ls /

# Using dbutils
files = dbutils.fs.ls("/")
for file in files:
    print(f"{file.name} - {file.size} bytes")

Secrets Management

# Cell 15: Access secrets
# First, create secret scope (via CLI or API)

# Retrieve secret
storage_key = dbutils.secrets.get(scope="my-scope", key="storage-key")

# Use secret (value hidden in logs)
print(f"Secret retrieved: {'*' * 10}")

Notebook Workflows

# Cell 16: Run another notebook
result = dbutils.notebook.run(
    "/Users/user@example.com/data-processing",
    timeout_seconds=300,
    arguments={"date": "2024-01-01"}
)
print(f"Result: {result}")

📝 Best Practices

1. Document Your Code

%md
# Data Processing Pipeline

## Overview
This notebook processes sales data from ADLS Gen2.

## Steps
1. Load raw data
2. Clean and transform
3. Save to Delta Lake

## Author: Your Name
## Last Updated: 2024-01-09

2. Organize with Sections

# === Configuration ===
storage_account = "mystorageaccount"
container = "data"

# === Data Loading ===
df = spark.read.csv(f"abfss://{container}@{storage_account}.dfs.core.windows.net/sales.csv")

# === Data Processing ===
df_clean = df.filter(col("amount") > 0)

# === Save Results ===
df_clean.write.format("delta").save("/mnt/processed/sales")

3. Use Functions for Reusability

# Cell 17: Define reusable functions
def load_data(path, format="csv"):
    """Load data from storage"""
    return spark.read.format(format) \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load(path)

def save_to_delta(df, table_name):
    """Save DataFrame to Delta table"""
    df.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable(table_name)
    print(f"✅ Saved to {table_name}")

4. Error Handling

# Cell 18: Implement error handling
try:
    df = spark.read.csv("/path/to/file.csv")
    print(f"✅ Loaded {df.count()} rows")
except Exception as e:
    print(f"❌ Error loading data: {str(e)}")
    # Handle error or log

🤝 Collaboration Features

Comments

# Cell 19: Add comments to cells
# Click the comment icon in the cell menu
# Team members can reply to comments
# Use @mention to notify specific users

print("Code with comments for team review")

Version Control

  1. Git Integration
  2. Workspace → Settings → Git Integration
  3. Connect to Azure DevOps or GitHub
  4. Commit, push, pull directly from notebooks

  5. Revision History

  6. File → Revision History
  7. See all changes
  8. Restore previous versions

Sharing

  1. Share Notebook
  2. Click "Share" button
  3. Set permissions:

    • Can View
    • Can Run
    • Can Edit
    • Can Manage
  4. Export Notebook

  5. File → Export
  6. Formats: DBC, HTML, iPython, Source File

⚡ Advanced Features

Parameterized Notebooks

# Cell 20: Define parameters
dbutils.widgets.text("date", "2024-01-01")
dbutils.widgets.text("environment", "prod")

# Use in code
process_date = dbutils.widgets.get("date")
env = dbutils.widgets.get("environment")

print(f"Processing data for {process_date} in {env}")

Databricks Connect

Run notebooks from local IDE:

# Local development
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder \
    .remote(
        host="<workspace-url>",
        token="<personal-access-token>",
        cluster_id="<cluster-id>"
    ) \
    .getOrCreate()

Notebook Jobs

Schedule notebooks:

  1. Workflows → Create Job
  2. Add notebook task
  3. Set schedule (cron)
  4. Configure cluster
  5. Set parameters
  6. Enable notifications

🔧 Troubleshooting

Common Issues

Cell Won't Execute

  • ✅ Check cluster status (running?)
  • ✅ Verify cluster attached to notebook
  • ✅ Check for syntax errors

Cannot See Widget

  • ✅ Run widget creation cell
  • ✅ Check if widgets hidden (View → Show Widgets)
  • ✅ Clear output and re-run

Import Errors

  • ✅ Install library on cluster
  • ✅ Use %pip install package-name
  • ✅ Restart Python kernel

Notebook Too Slow

  • ✅ Clear unused DataFrames
  • ✅ Unpersist cached data
  • ✅ Restart cluster
  • ✅ Use larger cluster

🎓 Practice Exercises

Exercise 1: Interactive Dashboard

Create a notebook with:

  • Dropdown for date range selection
  • Load data based on selection
  • Display summary statistics
  • Show 3 different visualizations

Exercise 2: Reusable Functions

Build a notebook library with:

  • Data loading functions
  • Transformation utilities
  • Validation checks
  • Error handling

Exercise 3: Multi-Language Pipeline

Create pipeline using:

  • Python for data loading
  • SQL for aggregations
  • Python for ML model
  • SQL for results

📚 Additional Resources

Documentation

Next Tutorials

🎉 Summary

You've learned:

✅ Notebook structure and features ✅ Multi-language support ✅ Visualization techniques ✅ Interactive widgets ✅ Collaboration features ✅ Best practices

Ready to build interactive data applications!


Last Updated: January 2025 Tutorial Version: 1.0