Compute Migration: BigQuery and Dataproc to Databricks and Fabric¶

A hands-on guide for data engineers migrating BigQuery SQL workloads, BigQuery ML, scheduled queries, and Dataproc Spark jobs to Databricks SQL, Fabric, and dbt.

Scope¶

This guide covers:

BigQuery SQL to SparkSQL / T-SQL dialect conversion
Slots to DBU/CU capacity mapping
Materialized views to dbt materializations
Scheduled queries to dbt + ADF triggers
BigQuery BI Engine to Direct Lake mode
Clustering and partitioning to Delta table optimization
BigQuery Omni to OneLake shortcuts
Dataproc to Databricks cluster migration

For storage migration, see Storage Migration. For Looker migration, see Analytics Migration.

BigQuery SQL to Databricks SQL / Fabric¶

Dialect conversion reference¶

The existing playbook (Section 4.3) documents the critical dialect differences. This guide expands on those with additional patterns.

Core syntax differences¶

BigQuery StandardSQL	Databricks SQL	Notes
`DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)`	`DATE_SUB(CURRENT_DATE(), 3)`	Argument form differs
`DATE_ADD(d, INTERVAL 7 DAY)`	`DATE_ADD(d, 7)`	Same pattern
`TIMESTAMP_DIFF(a, b, HOUR)`	`TIMESTAMPDIFF(HOUR, b, a)`	Argument order reversed
`SAFE_CAST(x AS INT64)`	`TRY_CAST(x AS BIGINT)`	Naming and type
`SAFE_DIVIDE(a, b)`	`TRY_DIVIDE(a, b)` or `a / NULLIF(b, 0)`	TRY_DIVIDE available in DBR 13+
`INT64`	`BIGINT`	Type name
`FLOAT64`	`DOUBLE`	Type name
`BOOL`	`BOOLEAN`	Type name
`BYTES`	`BINARY`	Type name
`STRING`	`STRING`	Same
`STRUCT<a INT64, b STRING>`	`STRUCT<a: BIGINT, b: STRING>`	Colon syntax in struct fields
`ARRAY<STRING>`	`ARRAY<STRING>`	Same
`UNNEST(arr)`	`explode(arr)` or `LATERAL VIEW explode(arr)`	Different keyword
`GENERATE_ARRAY(1, 10)`	`sequence(1, 10)`	Function name
`FORMAT_DATE('%Y-%m', d)`	`DATE_FORMAT(d, 'yyyy-MM')`	Format string syntax (Java vs strftime)
`PARSE_DATE('%Y-%m-%d', s)`	`TO_DATE(s, 'yyyy-MM-dd')`	Parse function name
`IF(cond, a, b)`	`IF(cond, a, b)`	Same
`IFNULL(a, b)`	`COALESCE(a, b)` or `IFNULL(a, b)`	Both work in Databricks
`STARTS_WITH(s, prefix)`	`s LIKE 'prefix%'` or `startswith(s, prefix)`	Function available in DBR 13+
`REGEXP_CONTAINS(s, r'pattern')`	`s RLIKE 'pattern'`	Different operator
`REGEXP_EXTRACT(s, r'pattern')`	`REGEXP_EXTRACT(s, 'pattern')`	Same function, different literal
`@@project_id` session var	`current_catalog()`	Session context
`@@dataset_id` session var	`current_schema()`	Session context

DDL differences¶

BigQuery	Databricks	Notes
`CREATE TABLE ... PARTITION BY date_col`	`CREATE TABLE ... PARTITIONED BY (date_col)`	Keyword plural
`CLUSTER BY col1, col2`	`OPTIMIZE table ZORDER BY (col1, col2)`	Separate command
`OPTIONS(partition_expiration_days=400)`	`VACUUM` + retention config	No auto-expiration; use scheduled `VACUUM`
`CREATE OR REPLACE TABLE`	`CREATE OR REPLACE TABLE`	Same
`CREATE TEMP TABLE`	`CREATE TEMPORARY VIEW` or temp table	Different semantics
`EXPORT DATA OPTIONS(...)`	`COPY INTO` or Spark write API	Export idiom differs

Window function differences¶

BigQuery	Databricks	Notes
`QUALIFY ROW_NUMBER() OVER(...) = 1`	Wrap in subquery with `WHERE rn = 1`	QUALIFY not supported in Databricks SQL
`FIRST_VALUE(x IGNORE NULLS)`	`FIRST_VALUE(x) IGNORE NULLS`	Placement differs
`LAST_VALUE(x IGNORE NULLS)`	`LAST_VALUE(x) IGNORE NULLS`	Placement differs
`PERCENTILE_CONT(x, 0.5)`	`PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)`	SQL standard syntax

Automated dialect conversion¶

For large codebases, manual conversion is impractical. Use a systematic approach:

Regex-based find-and-replace for type names (INT64 to BIGINT, FLOAT64 to DOUBLE)
AST-based tools (sqlglot) for complex syntax transformations
Manual review for QUALIFY, UNNEST, struct literals, and cross-joins

# Using sqlglot for automated conversion
import sqlglot

bq_sql = """
SELECT
  DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) AS start_date,
  SAFE_CAST(revenue AS FLOAT64) AS revenue,
  ARRAY_AGG(STRUCT(product_id, quantity)) AS line_items
FROM `acme-gov.sales.orders`
WHERE REGEXP_CONTAINS(region, r'^US-')
"""

databricks_sql = sqlglot.transpile(bq_sql, read="bigquery", write="databricks")[0]
print(databricks_sql)

Slots to DBU/CU mapping¶

Conceptual mapping¶

BigQuery concept	Azure equivalent	Notes
Slot	Databricks DBU or Fabric CU	Not 1:1; depends on workload shape
On-demand slots	Databricks Serverless SQL	Auto-scaling, no pre-provisioning
Standard Edition slots	Databricks SQL Classic (small)	Entry-level committed compute
Enterprise Edition slots	Databricks SQL Classic/Serverless	Mid-tier with governance features
Enterprise Plus slots	Databricks SQL + UC Premium	Advanced security and compliance
Flex slots (deprecated)	N/A	Use serverless or auto-scaling instead
Reservation	Reserved capacity (Databricks or Fabric)	1-3 year commitment for discounts
Assignment	Workspace allocation	Capacity assigned to specific workspaces

Sizing guidance¶

There is no direct slot-to-DBU conversion formula because the architectures differ. Use this heuristic:

BigQuery slot count	Databricks SQL Warehouse size	Fabric capacity	Notes
100 slots	Small (8-16 DBU/hour)	F32	Light workloads
500 slots	Medium (32-64 DBU/hour)	F64	Mid-sized analytics
1,000 slots	Large (64-128 DBU/hour)	F128	Heavy analytics
2,000+ slots	X-Large + multiple warehouses	F256	Enterprise scale

Recommendation: Start with the estimated size, run representative workloads for 2 weeks, then right-size based on actual DBU consumption.

Materialized views to dbt materializations¶

BigQuery materialized views auto-refresh when base tables change. The dbt equivalent depends on the refresh pattern:

BigQuery MV pattern	dbt equivalent	When to use
Auto-refresh on write	Delta Live Tables (DLT)	Real-time or near-real-time refresh
Scheduled refresh	dbt incremental model + scheduled job	Batch refresh on schedule
Query-time refresh	dbt ephemeral model	Computed on each query
Complex aggregation MV	dbt incremental with merge strategy	Aggregation over fact tables

Worked example: BigQuery MV to dbt incremental¶

BigQuery materialized view:

CREATE MATERIALIZED VIEW `acme-gov.finance.mv_daily_revenue`
PARTITION BY sales_date
CLUSTER BY region
AS
SELECT
  DATE(order_ts) AS sales_date,
  region,
  SUM(gross_amount) AS daily_revenue,
  COUNT(*) AS order_count
FROM `acme-gov.sales.orders`
GROUP BY 1, 2;

dbt incremental model:

-- models/gold/daily_revenue.sql
{{ config(
    materialized='incremental',
    unique_key=['sales_date', 'region'],
    incremental_strategy='merge',
    partition_by=['sales_date']
) }}

SELECT
  DATE(order_ts) AS sales_date,
  region,
  SUM(gross_amount) AS daily_revenue,
  COUNT(*) AS order_count
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE DATE(order_ts) >= DATE_SUB(CURRENT_DATE(), 3)
{% endif %}
GROUP BY 1, 2

Scheduled queries to dbt + ADF triggers¶

BigQuery scheduled query pattern	Azure equivalent	Implementation
Simple daily/hourly query	Databricks Workflow schedule	Cron-based schedule on a SQL task
Query with downstream dependencies	dbt job with model dependencies	dbt handles DAG ordering automatically
Cross-system orchestration	ADF pipeline with triggers	ADF orchestrates across Databricks, Fabric, external systems
Event-driven (on table update)	Databricks Auto Loader + DLT	File-arrival triggers processing

Migration steps¶

Inventory all scheduled queries from BigQuery console or INFORMATION_SCHEMA.SCHEDULED_QUERIES
Classify each query as simple (single table refresh), dependent (DAG chain), or cross-system
Convert simple queries to dbt models with Databricks Workflow schedules
Convert dependent chains to dbt jobs (dbt manages the DAG)
Convert cross-system queries to ADF pipelines calling dbt and other activities

BigQuery BI Engine to Direct Lake¶

BigQuery BI Engine is an in-memory acceleration layer that speeds up BI queries over BigQuery tables. The Azure equivalent is Power BI Direct Lake mode.

BigQuery BI Engine	Power BI Direct Lake	Notes
In-memory cache of BigQuery data	Direct read from Delta Lake in OneLake	No data copy -- reads Delta files directly
Automatic refresh	Automatic refresh on Delta changes	Near-real-time without scheduled imports
Reservation-based (GB of memory)	Included in Fabric capacity	No separate reservation needed
Optimized for Looker/BI queries	Optimized for Power BI queries	Native integration

Migration: Once data is in Delta Lake on OneLake, create a Direct Lake semantic model in Power BI that points to the Delta tables. No BI Engine configuration is needed -- Direct Lake is the default mode for Fabric lakehouses.

Clustering and partitioning to Delta optimization¶

Partitioning¶

BigQuery partition type	Delta equivalent	Migration notes
Date/timestamp column	`PARTITIONED BY (date_col)`	Direct mapping
Integer range partition	`PARTITIONED BY (int_col)`	May need bucketing for equivalent performance
Ingestion-time (_PARTITIONTIME)	`PARTITIONED BY (_ingest_date)`	Add explicit ingest date column
No partitioning	No partitioning needed for small tables	Delta stats-based pruning often sufficient

Clustering to Z-ordering¶

BigQuery clustering is automatic and maintenance-free. Delta Z-ordering requires explicit OPTIMIZE commands.

-- Run after data load or on a schedule
OPTIMIZE finance.fact_sales_daily ZORDER BY (region, product_id);

-- Enable auto-compaction for ongoing optimization
ALTER TABLE finance.fact_sales_daily SET TBLPROPERTIES (
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.autoOptimize.optimizeWrite' = 'true'
);

Best practice: Schedule OPTIMIZE as a post-load step in dbt or as a Databricks Workflow task.

BigQuery Omni to OneLake shortcuts¶

BigQuery Omni allows querying data in S3 or Azure Storage from BigQuery. During migration, use OneLake shortcuts for the reverse: querying GCS data from Azure.

BigQuery Omni feature	Azure equivalent	Notes
External connection to S3	OneLake shortcut to S3	Zero-copy read
External connection to Azure Storage	OneLake shortcut to ADLS	Zero-copy read
Cross-cloud query	Lakehouse Federation	Query external sources from Databricks SQL
Bi-directional cross-cloud	Not fully replicated	Azure reads GCS; BigQuery reads Azure -- not unified console

Dataproc to Databricks¶

Cluster migration¶

Dataproc concept	Databricks equivalent	Notes
Cluster (master + workers)	All-purpose cluster	Interactive workloads
Autoscaling cluster	Auto-scaling cluster + serverless	Serverless eliminates cluster management
Serverless Spark	Serverless SQL + Jobs	No cluster management
Init actions	Cluster init scripts + policies	Libraries via cluster policy
Component gateway	Workspace web terminal	Browser-based access
Jupyter on Dataproc	Databricks Notebooks	Richer collaboration features
Job submission (gcloud dataproc jobs)	Databricks Jobs API / Workflows	REST API + CLI

Spark compatibility¶

Dataproc uses open-source Apache Spark. Databricks uses the Databricks Runtime, which is a superset of Apache Spark with performance optimizations (Photon engine).

Compatibility notes:

PySpark code runs on Databricks with minimal changes
Spark SQL is compatible; some GCP-specific UDFs need porting
Spark JAR jobs run on Databricks with the same JAR
Hive metastore tables bridge via Unity Catalog external metastore
GCS paths (gs://) need translation to ADLS paths (abfss://)
GCP-specific libraries (e.g., BigQuery connector) are replaced by Databricks native connectors

Migration steps¶

Inventory Dataproc clusters, jobs, notebooks, and init actions
Map cluster configurations to Databricks cluster policies
Port Spark jobs -- change storage paths, remove GCP-specific operators
Port notebooks -- change gs:// to abfss://, update library imports
Test on Databricks with representative data
Schedule using Databricks Workflows

Validation checklist¶

After migrating compute workloads:

All BigQuery scheduled queries have equivalent dbt models or Databricks jobs
SQL dialect conversion tested with representative queries
Partition and Z-order strategies applied to Delta tables
Query performance is comparable or better (benchmark key queries)
dbt tests pass for all migrated models
Databricks Workflow schedules match original BigQuery schedules
Dataproc Spark jobs run successfully on Databricks
MLflow models registered for any BigQuery ML conversions

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Storage Migration | ETL Migration | Analytics Migration | Migration Playbook