Databricks Best Practices for CSA-in-a-Box¶
See also: generic Azure reference
For service-agnostic deep-dive content on Azure Databricks — architecture, feature reference, code samples, and patterns independent of CSA-in-a-Box — see Azure Databricks in the reference library.
Overview¶
Azure Databricks is the primary heavy-compute engine for CSA-in-a-Box (ADR-0002). It powers medallion transformations, large-scale enrichment, ML feature engineering, and dbt-driven analytics across Azure Commercial and Azure Government tenants.
This guide distills operational best practices into actionable patterns so that every workspace, cluster, notebook, and pipeline follows the same hardened baseline. It is meant to be read alongside the Databricks Guide (setup/operations) and the Performance Tuning reference (deep Delta/Spark tuning).
flowchart LR
subgraph Governance
UC[Unity Catalog]
KV[Key Vault]
end
subgraph Compute
IC[Interactive Cluster]
JC[Job Cluster]
SW[SQL Warehouse]
end
subgraph Storage
B[Bronze]
S[Silver]
G[Gold]
end
UC --> Compute
KV --> Compute
Compute --> Storage Workspace Organization¶
Folder Structure¶
Organize each Databricks workspace with a standard folder tree so that any engineer can orient themselves in seconds.
/Workspace
/Shared
/notebooks
/bronze # Raw-to-cleansed ingestion
/silver # Conformance, dedup, SCD
/gold # Business aggregations, star schemas
/utilities # Reusable helpers (logging, audit, config)
/orchestration # Master job notebooks, dbt runners
/libs # Shared .whl / .jar / init scripts
/config # Environment-specific JSON/YAML configs
/Repos # Git-backed repos (recommended over Shared)
/Users
/<user@domain> # Personal scratch notebooks — never promoted to prod
Naming Conventions¶
| Artifact | Pattern | Example |
|---|---|---|
| ETL notebook | <source>_to_<target>_spark.py | erp_to_silver_spark.py |
| Utility notebook | <action>_<subject>.py | optimize_delta_tables.py |
| Domain notebook | <domain>_<action>_<subject>.py | sales_daily_transform.py |
| Job name | [env]-[domain]-[layer]-[action] | prod-sales-gold-aggregate |
| Cluster policy | policy-[env]-[tier] | policy-prod-standard |
| Secret scope | scope-[env]-[purpose] | scope-prod-adls |
Workspace-Level vs. Cluster-Level Config¶
| Setting | Where to Set | Why |
|---|---|---|
| Unity Catalog metastore | Workspace | One metastore per workspace, set by admin |
| Default catalog / schema | Cluster Spark config | Varies by environment and job |
| ADLS OAuth credentials | Cluster Spark config | Scoped to the cluster's managed identity |
| Delta auto-optimize flags | Cluster Spark config | Consistent across all notebooks on cluster |
| Notebook-specific widgets | Notebook | Per-run parameterization |
Multi-Environment Setup¶
One workspace per environment
Use separate workspaces for dev, staging, and prod. This gives hard network and IAM boundaries, prevents accidental writes to production storage, and aligns with Unity Catalog workspace bindings.
graph LR
DEV[Dev Workspace] -->|promote via CI/CD| STG[Staging Workspace]
STG -->|release gate| PROD[Prod Workspace]
UC[Unity Catalog Metastore] --- DEV
UC --- STG
UC --- PROD | Environment | Workspace SKU | Purpose | Who has access |
|---|---|---|---|
| Dev | Premium | Exploration, prototyping | All engineers |
| Staging | Premium | Integration testing | Engineers + CI bot |
| Prod | Premium | Production workloads | CI bot + on-call only |
Cluster Configuration¶
Job Clusters vs. All-Purpose Clusters¶
| Criterion | Job Cluster | All-Purpose Cluster |
|---|---|---|
| Use case | Scheduled / CI pipelines | Ad-hoc exploration, dev |
| Lifecycle | Spins up per run, dies | Long-running, user-managed |
| Cost model | Pay only during run | Pay while idle too |
| Recommended for | All production work | Dev / debugging only |
Never use all-purpose clusters in production
All-purpose clusters remain running between jobs, waste DBUs, and lack the deterministic environment guarantees that job clusters provide.
Autoscaling Policies¶
| Workload Type | Min Workers | Max Workers | Rationale |
|---|---|---|---|
| Light ETL | 1 | 4 | Small data, fast finish |
| Standard ETL | 2 | 8 | Balanced cost vs. throughput |
| Heavy transform | 4 | 16 | Large shuffles, complex joins |
| ML training | 4 | 32 | GPU-intensive or wide hyperparameter sweeps |
| Streaming | 2 | 8 | Steady state with burst headroom |
Set spark.databricks.aggressiveWindowDownS to 120
This tells the autoscaler to wait 2 minutes of low utilization before removing workers, preventing thrashing on bursty workloads.
Spot Instances (Azure Spot VMs)¶
| Do | Don't |
|---|---|
| Use spot for worker nodes (up to 80% savings) | Use spot for the driver node |
Set first_on_demand: 1 so the driver is always on-demand | Rely on 100% spot for latency-sensitive jobs |
Set spot bid to max_price: -1 (pay market rate) | Set a fixed bid that causes constant evictions |
| Target 50-70% spot ratio for production jobs | Exceed 80% spot — revocation risk rises sharply |
Photon Acceleration¶
Photon is a vectorized C++ query engine that accelerates SQL and DataFrame workloads on Delta Lake.
When to use Photon:
- SQL-heavy Silver-to-Gold transformations
- Large
JOIN,GROUP BY,WINDOWoperations - Aggregation-heavy Gold tables
- Delta
MERGE/UPDATE/DELETEat scale
When to skip Photon:
- Pure Python / pandas UDF workloads (Photon cannot accelerate UDFs)
- ML training notebooks (use ML Runtime instead)
- Small data (< 10 GB) where startup overhead exceeds savings
DBR Version Pinning¶
Always pin to an LTS release
Never use latest or non-LTS runtimes in production. Pin the exact version in your cluster policy and job definitions.
{
"dbus_per_hour": { "type": "unlimited" },
"spark_version": {
"type": "fixed",
"value": "14.3.x-scala2.12"
}
}
LTS cadence: Databricks LTS versions are supported for ~2 years. Upgrade during a planned maintenance window, not reactively.
Init Scripts¶
Use init scripts sparingly. Prefer cluster-scoped scripts stored in Unity Catalog Volumes over workspace or DBFS scripts.
#!/bin/bash
# /Volumes/main/shared/init-scripts/install_deps.sh
pip install --quiet great-expectations==0.18.* pyarrow==15.*
echo "Init script completed at $(date)"
Never put secrets in init scripts
Init scripts are stored in plain text. Use Databricks secret scopes backed by Azure Key Vault instead (see Secret Management).
Cluster Policy Example¶
{
"name": "policy-prod-standard",
"definition": {
"spark_version": {
"type": "fixed",
"value": "14.3.x-scala2.12"
},
"node_type_id": {
"type": "allowlist",
"values": ["Standard_DS3_v2", "Standard_DS4_v2", "Standard_DS5_v2"]
},
"driver_node_type_id": {
"type": "fixed",
"value": "Standard_DS4_v2"
},
"autoscale.min_workers": {
"type": "range",
"minValue": 1,
"maxValue": 4
},
"autoscale.max_workers": {
"type": "range",
"minValue": 2,
"maxValue": 16
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 60,
"defaultValue": 30
},
"azure_attributes.first_on_demand": {
"type": "fixed",
"value": 1
},
"azure_attributes.spot_bid_max_price": {
"type": "fixed",
"value": -1
},
"custom_tags.Environment": {
"type": "fixed",
"value": "production"
},
"custom_tags.CostCenter": {
"type": "required"
},
"spark_conf.spark.databricks.delta.optimizeWrite.enabled": {
"type": "fixed",
"value": "true"
},
"spark_conf.spark.databricks.delta.autoCompact.enabled": {
"type": "fixed",
"value": "true"
}
}
}
Notebook Best Practices¶
Parameterized Notebooks with dbutils.widgets¶
Use widgets to make notebooks reusable across environments and domains.
# Define widgets at the top of every notebook
dbutils.widgets.text("environment", "dev", "Environment")
dbutils.widgets.text("domain", "sales", "Domain")
dbutils.widgets.dropdown("mode", "incremental", ["full", "incremental"], "Mode")
dbutils.widgets.text("start_date", "", "Start Date (optional)")
# Read values
env = dbutils.widgets.get("environment")
domain = dbutils.widgets.get("domain")
mode = dbutils.widgets.get("mode")
%run vs. dbutils.notebook.run¶
| Feature | %run | dbutils.notebook.run |
|---|---|---|
| Execution context | Same context (shares variables) | Separate context (isolated) |
| Return value | No | Yes (string) |
| Timeout control | No | Yes (timeout_seconds) |
| Error isolation | Failure crashes caller | Failure returns error, caller lives |
| Best for | Loading shared utility code | Orchestrating independent steps |
Use %run for imports, dbutils.notebook.run for orchestration
Load shared functions with %run ./utilities/common_functions at the top of your notebook. Use dbutils.notebook.run to chain independent ETL steps so that a failure in step 2 does not corrupt state from step 1.
Notebook Workflows vs. Jobs API¶
| Approach | When to Use |
|---|---|
| Single-notebook Job | Simple ETL with one stage |
| Multi-task Job (DAG) | Production pipelines with dependencies |
dbutils.notebook.run | Quick prototyping, small fan-out |
| ADF / Airflow | Cross-service orchestration (storage + Databricks + Synapse) |
Modular Code Organization¶
# Recommended project layout in Repos
/my-domain-repo
/src
/transforms
bronze.py # Bronze-layer logic
silver.py # Silver-layer logic
gold.py # Gold-layer logic
/utils
config.py # Config loader (reads widgets + env)
logging.py # Structured logging helpers
delta_helpers.py # OPTIMIZE, VACUUM, merge wrappers
/notebooks
main_pipeline.py # Orchestration notebook — calls src/
adhoc_exploration.py # Scratch — never promoted
/tests
test_transforms.py # Unit tests runnable via nutter or pytest
pyproject.toml
Delta Lake Optimization¶
For deep-dive tuning, see Performance Tuning.
OPTIMIZE + Z-ORDER Scheduling¶
-- Run nightly after ingestion completes
OPTIMIZE unity_catalog.sales.transactions
ZORDER BY (tenant_id, transaction_date);
-- Check optimization history
DESCRIBE HISTORY unity_catalog.sales.transactions;
| Do | Don't |
|---|---|
| Schedule OPTIMIZE nightly or after large batches | Run OPTIMIZE after every micro-batch |
| Z-ORDER on 2-3 high-cardinality filter columns | Z-ORDER on > 4 columns (dilutes benefit) |
Z-ORDER on columns used in WHERE and JOIN | Z-ORDER on columns only used in SELECT |
Liquid Clustering (DBR 13.3+)¶
Liquid clustering replaces both partitioning and Z-ORDER with adaptive, incremental clustering. Prefer it for new tables on DBR 13.3 LTS or later.
-- New table with liquid clustering
CREATE TABLE catalog.schema.events (
event_id BIGINT,
tenant_id STRING,
event_date DATE,
payload STRING
)
CLUSTER BY (tenant_id, event_date);
-- Migrate existing partitioned table
ALTER TABLE catalog.schema.events
CLUSTER BY (tenant_id, event_date);
-- Trigger incremental clustering
OPTIMIZE catalog.schema.events;
Liquid clustering vs. Z-ORDER
Liquid clustering is incremental (only rearranges new/changed data), whereas Z-ORDER rewrites all files. For append-heavy Bronze/Silver tables, liquid clustering avoids expensive full rewrites.
VACUUM Scheduling¶
-- Default retention: 7 days (168 hours)
VACUUM catalog.schema.transactions;
-- Explicit retention for compliance (30 days)
VACUUM catalog.schema.transactions RETAIN 720 HOURS;
Never set retention below 7 days
Concurrent readers may still reference old files. Setting retention below 7 days (168 HOURS) risks breaking time-travel queries and active reads.
Auto-Optimize Settings¶
Enable these at the cluster level (via cluster policy) so every write benefits automatically.
| Setting | Spark Config Key | Effect |
|---|---|---|
| Optimized Writes | spark.databricks.delta.optimizeWrite.enabled = true | Coalesces small partitions on write |
| Auto Compaction | spark.databricks.delta.autoCompact.enabled = true | Runs mini-OPTIMIZE after writes |
Delta Table Properties Checklist¶
-
delta.autoOptimize.optimizeWrite = true -
delta.autoOptimize.autoCompact = true -
delta.logRetentionDuration = 30 days(or per compliance) -
delta.deletedFileRetentionDuration = 7 days(minimum) -
delta.enableChangeDataFeed = true(if downstream CDC consumers exist) -
delta.columnMapping.mode = name(for schema evolution)
Change Data Feed¶
Enable Change Data Feed (CDF) when downstream consumers need row-level change events (inserts, updates, deletes) without scanning full tables.
# Read changes since a specific version
changes_df = (
spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 42)
.table("catalog.schema.customers")
)
# Filter to updates only
updates = changes_df.filter("_change_type = 'update_postimage'")
dbt on Databricks¶
Adapter Configuration¶
# profiles.yml
csa_inabox:
target: dev
outputs:
dev:
type: databricks
catalog: unity_dev
schema: "{{ env_var('DBT_SCHEMA', 'dbt_dev') }}"
host: "{{ env_var('DATABRICKS_HOST') }}"
http_path: "{{ env_var('DATABRICKS_HTTP_PATH') }}"
token: "{{ env_var('DATABRICKS_TOKEN') }}"
threads: 4
connect_retries: 3
prod:
type: databricks
catalog: unity_prod
schema: gold
host: "{{ env_var('DATABRICKS_HOST') }}"
http_path: "{{ env_var('DATABRICKS_HTTP_PATH') }}"
token: "{{ env_var('DATABRICKS_TOKEN') }}"
threads: 8
connect_retries: 5
Incremental Strategies¶
| Strategy | When to Use | Config |
|---|---|---|
merge | SCD Type 1, upserts, most use cases | unique_key required |
insert_overwrite | Full partition replacement (append-only) | partition_by required |
append | Event/log tables, no dedup needed | Fastest, no merge overhead |
-- dbt model: models/gold/fct_orders.sql
{{
config(
materialized='incremental',
incremental_strategy='merge',
unique_key='order_id',
file_format='delta',
on_schema_change='append_new_columns'
)
}}
SELECT
order_id,
customer_id,
order_total,
_loaded_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
{% endif %}
Unity Catalog Integration¶
# dbt_project.yml
models:
csa_inabox:
+persist_docs:
relation: true
columns: true
bronze:
+schema: bronze
+catalog: "{{ env_var('DBT_CATALOG', 'unity_dev') }}"
silver:
+schema: silver
+catalog: "{{ env_var('DBT_CATALOG', 'unity_dev') }}"
gold:
+schema: gold
+catalog: "{{ env_var('DBT_CATALOG', 'unity_dev') }}"
Model Grants¶
# models/gold/fct_orders.yml
models:
- name: fct_orders
config:
grants:
select: ["analysts", "data_scientists"]
Spark Performance¶
Adaptive Query Execution (AQE)¶
AQE is enabled by default on DBR 12.2+ and dynamically adjusts query plans at runtime. Verify it is not disabled.
Key AQE behaviors:
- Coalesces shuffle partitions — reduces 200 default partitions to match actual data volume
- Converts sort-merge joins to broadcast joins at runtime when one side is small
- Optimizes skew joins — splits skewed partitions automatically
Key Spark Tuning Settings¶
| Setting | Default | Recommended | Why |
|---|---|---|---|
spark.sql.adaptive.enabled | true | true | Dynamic plan optimization |
spark.sql.adaptive.coalescePartitions.enabled | true | true | Auto-right-size shuffle partitions |
spark.sql.shuffle.partitions | 200 | auto (AQE) or tune | 200 is too many for small data, too few for large |
spark.sql.autoBroadcastJoinThreshold | 10MB | 30MB for large clusters | Broadcast avoids shuffle entirely |
spark.databricks.io.cache.enabled | true | true | SSD cache for hot Delta data |
Broadcast Join Thresholds¶
# Force broadcast for a known-small dimension table
from pyspark.sql.functions import broadcast
result = (
fact_df
.join(broadcast(dim_df), "customer_id")
)
| Do | Don't |
|---|---|
| Broadcast dimension tables < 100 MB | Broadcast tables that grow unpredictably |
| Let AQE auto-convert when threshold is met | Disable AQE to "control" join strategies |
Spill Management¶
Spill occurs when Spark runs out of execution memory and writes intermediate data to disk. It severely degrades performance.
Signs of spill: Check the Spark UI for "Spill (Memory)" and "Spill (Disk)" columns in the Stage detail view.
Fixes:
- Increase
spark.executor.memoryor use a larger node type - Reduce
spark.sql.shuffle.partitionsto decrease per-partition size - Filter data earlier in the pipeline (
pushdown predicates) - Avoid
collect()ortoPandas()on large DataFrames
UDF Avoidance¶
Avoid Python UDFs in production pipelines
Python UDFs serialize data row-by-row between the JVM and Python, creating a massive performance bottleneck. Use built-in Spark SQL functions or Pandas UDFs (vectorized) instead.
| Approach | Relative Speed | When to Use |
|---|---|---|
| Built-in SQL/DF | 1x (fastest) | Always prefer |
| Pandas UDF | 2-5x slower | Complex Python logic, batch-vectorized |
| Python UDF | 10-100x slower | Last resort, simple row transforms only |
# Bad: Python UDF
@udf("string")
def clean_name(name):
return name.strip().title()
# Good: built-in functions
from pyspark.sql.functions import trim, initcap
df = df.withColumn("clean_name", initcap(trim(col("name"))))
Secret Management¶
Databricks Secret Scopes Backed by Azure Key Vault¶
# Create a Key-Vault-backed secret scope (Databricks CLI)
databricks secrets create-scope \
--scope scope-prod-adls \
--scope-backend-type AZURE_KEYVAULT \
--resource-id /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault> \
--dns-name https://<vault>.vault.azure.net/
# Access secrets in notebooks
storage_key = dbutils.secrets.get(scope="scope-prod-adls", key="adls-access-key")
# Secrets are redacted in notebook output — [REDACTED] appears in logs
print(storage_key) # prints: [REDACTED]
Secret ACLs¶
# Grant read access to a group
databricks secrets put-acl \
--scope scope-prod-adls \
--principal data-engineers \
--permission READ
| Do | Don't |
|---|---|
| Use Key-Vault-backed scopes in production | Use Databricks-backed scopes (no rotation, no audit trail) |
Grant READ to groups, never individuals | Grant MANAGE broadly |
| Rotate secrets via Key Vault policies | Hardcode tokens in notebooks or init scripts |
Reference secrets via dbutils.secrets.get | Pass secrets as widget parameters |
Cost Optimization¶
Auto-Termination¶
| Cluster Type | Auto-Termination | Rationale |
|---|---|---|
| Dev / adhoc | 30 minutes | Engineers forget to shut down |
| CI / staging | 15 minutes | Short-lived test runs |
| Production job | N/A (job cluster) | Cluster dies automatically post-job |
Right-Sizing Guide¶
| Workload | Recommended VM | vCPUs | RAM (GB) | Notes |
|---|---|---|---|---|
| Light dev / notebook | Standard_DS3_v2 | 4 | 14 | Single-user exploration |
| Standard ETL | Standard_DS4_v2 | 8 | 28 | Balanced cost/performance |
| Memory-intensive | Standard_E8s_v3 | 8 | 64 | Large joins, wide tables |
| Compute-intensive | Standard_F8s_v2 | 8 | 16 | CPU-bound transforms |
| GPU / ML training | Standard_NC6s_v3 | 6 | 112 | CUDA workloads |
DBU Consumption Monitoring¶
# Query Databricks system tables for DBU usage (Unity Catalog required)
dbu_usage = spark.sql("""
SELECT
usage_date,
workspace_id,
sku_name,
SUM(usage_quantity) AS total_dbus
FROM system.billing.usage
WHERE usage_date >= DATEADD(DAY, -30, CURRENT_DATE())
GROUP BY usage_date, workspace_id, sku_name
ORDER BY usage_date DESC
""")
dbu_usage.display()
Cost Control Checklist¶
- Cluster policies enforced on all workspaces (prevents over-provisioning)
- Auto-termination set on all interactive clusters
- Job clusters used for all production workloads
- Spot instances at 50-70% ratio on worker nodes
-
CostCentertag required by cluster policy - DBU usage reviewed weekly via system tables or Azure Cost Management
- Unused clusters and warehouses decommissioned monthly
Monitoring¶
Spark UI¶
The Spark UI is your first stop for diagnosing slow stages, skew, and spill.
| Tab | What to Look For |
|---|---|
| Stages | Spill (Memory/Disk), shuffle read/write sizes |
| SQL | Physical plan, scan stats, broadcast vs. sort-merge |
| Storage | Cached RDDs/DataFrames, memory usage |
| Executors | GC time, failed tasks, blacklisted nodes |
Azure Monitor Integration¶
Route cluster logs and metrics to Log Analytics for centralized observability.
{
"log_analytics": {
"workspace_id": "<LOG_ANALYTICS_WORKSPACE_ID>",
"workspace_key": "{{secrets/scope-prod-monitoring/la-key}}"
},
"log_delivery": {
"cluster_logs": true,
"spark_driver_logs": true,
"spark_executor_logs": true,
"init_script_logs": true
}
}
Alert Configuration¶
Set up alerts for:
| Condition | Threshold | Action |
|---|---|---|
| Job run duration > 2x baseline | Per-job baseline | Notify + investigate |
| Cluster idle > auto-termination | Policy value | Auto-terminate |
| DBU spend > daily budget | Per-workspace | Alert team + freeze |
| Failed job runs in sequence | 3 consecutive | Page on-call |
| Disk spill exceeding 10 GB | Per-stage | Right-size cluster |
Security¶
Network Isolation (VNet Injection)¶
Deploy Databricks workspaces into a customer-managed VNet so that all cluster traffic stays within the private network.
graph TD
subgraph Customer VNet
subgraph Private Subnet
W[Worker Nodes]
end
subgraph Public Subnet
CP[Control Plane NAT]
end
PE[Private Endpoint — ADLS, Key Vault, SQL]
end
CP -.->|Secure tunnel| DB[Databricks Control Plane]
W --> PE Bicep reference: deploy/bicep/DMLZ/modules/Databricks/databricks.bicep
Security Hardening Checklist¶
- VNet injection enabled for all workspaces
- Private Link for control plane connectivity (no public IP)
- NSG rules restrict egress to Azure backbone + required endpoints only
- IP access lists restrict workspace UI/API access to corporate CIDR ranges
- Conditional access via Entra ID — require MFA, compliant device
- Unity Catalog for table-level and column-level ACLs (replaces legacy table ACLs)
- Credential passthrough disabled; use service principals or managed identities
- Cluster policies prevent use of unrestricted node types or public subnets
- Audit logs shipped to Log Analytics (NIST 800-53 AU-12)
- Customer-managed keys (CMK) for DBFS encryption at rest
Table ACLs vs. Unity Catalog¶
| Feature | Legacy Table ACLs | Unity Catalog |
|---|---|---|
| Scope | Per-workspace | Cross-workspace, cross-cloud |
| Column masking | No | Yes (row filters + column masks) |
| Lineage | No | Yes (automatic, queryable) |
| External locations | Hive metastore | Managed via Unity Catalog |
| Recommendation | Migrate away | Use for all new workloads |
Anti-Patterns¶
Long-running all-purpose clusters for production
All-purpose clusters stay alive between jobs, waste DBUs during idle windows, and lack the clean-environment guarantees of job clusters. Always use job clusters for scheduled workloads.
No cluster policies
Without policies, any user can spin up a 64-node GPU cluster. Cluster policies are the single most effective cost-control lever. Enforce them from day one.
Running everything as workspace admin
Workspace admins bypass Unity Catalog ACLs. Create granular groups (data-engineers, analysts, ml-engineers) and assign least-privilege permissions. Reserve admin for infrastructure automation only.
Skipping OPTIMIZE on append-heavy tables
Streaming and micro-batch ingestion creates thousands of small files. Without OPTIMIZE (or auto-compaction), query performance degrades exponentially as file count grows. See Delta Lake Optimization.
Over-provisioned driver nodes
The driver coordinates work but does not process data. A driver 4x the size of workers wastes money. Match the driver to worker size or one tier above.
Hardcoding secrets in notebooks or repos
Tokens, keys, and passwords in source code are visible in version history forever, even after deletion. Use Key-Vault-backed secret scopes exclusively.
Disabling AQE to 'control' performance
AQE dynamically optimizes shuffle partitions, join strategies, and skew handling. Disabling it forces static plans that are worse in nearly every scenario. Leave it enabled and tune thresholds instead.
Cross-References¶
| Topic | Document |
|---|---|
| Why Databricks over OSS Spark | ADR-0002 |
| Setup and operations guide | Databricks Guide |
| Delta Lake and Spark deep tuning | Performance Tuning |
| Medallion layer design | Medallion Architecture |
| Fabric vs. Databricks vs. Synapse | Decision Matrix |
| Fabric migration path | Databricks to Fabric |
| Reference architecture overview | Fabric vs. Synapse vs. Databricks |
| OSS Spark alternative | OSS Ecosystem Guide |
| Bicep deployment modules | deploy/bicep/DMLZ/modules/Databricks/databricks.bicep |
| NIST 800-53 controls | governance/compliance/nist-800-53-rev5.yaml (AC-3, AU-12, SC-8) |