Tutorial: Migrate BigQuery Tables to Fabric / Databricks¶

A hands-on, step-by-step walkthrough for data engineers migrating BigQuery datasets to Microsoft Fabric and Azure Databricks, using the csa-inabox medallion architecture.

Estimated time: 4-6 hours (full pipeline); 1-2 hours (single table) Difficulty: Intermediate GCP experience assumed: BigQuery SQL, bq CLI, GCS basics

Prerequisites¶

Before starting this tutorial, ensure you have the following:

Requirement	Details
GCP project	With BigQuery datasets you intend to migrate; `roles/bigquery.dataViewer` and `roles/bigquery.jobUser` on the source project
GCS bucket	For staging Parquet exports; `roles/storage.objectAdmin` on the bucket
`gcloud` CLI	Authenticated with `gcloud auth login` and project set via `gcloud config set project PROJECT_ID`
`bq` CLI	Bundled with `gcloud`; verify with `bq version`
Azure subscription	With permissions to create Storage Account, Databricks workspace, and Fabric capacity
ADLS Gen2 storage account	Provisioned with hierarchical namespace enabled
Azure Databricks workspace	With Unity Catalog enabled; `CREATE CATALOG` / `CREATE SCHEMA` privileges
AzCopy	Installed locally or available in Cloud Shell; `azcopy --version`
dbt Core	`pip install dbt-databricks` (v1.7+)
Git	For version-controlling dbt models

GCP comparison: In BigQuery, storage and compute are billed separately but managed as one service. On Azure, ADLS Gen2 provides the storage layer and Databricks SQL provides the compute layer. The separation is explicit, which gives you more control over cost and scaling.

Scenario¶

You are migrating a BigQuery project acme-gov containing:

A sales dataset with order_lines (500 GB, partitioned by order_date, clustered by region)
A finance dataset with fact_sales_daily (materialized via scheduled query) and dimension tables
Scheduled queries that refresh fact_sales_daily at 02:00 UTC daily
A Looker explore consuming the finance dataset (covered in the companion Looker tutorial)

By the end of this tutorial you will have these tables in Delta Lake on ADLS Gen2, queryable through Databricks SQL, with dbt models replacing BigQuery scheduled queries.

Step 1: Inventory BigQuery datasets¶

Before exporting anything, build a complete inventory of what you are migrating.

1.1 List all datasets and tables¶

# List all datasets in the project
bq ls --project_id=acme-gov

# List all tables in a dataset with details
bq ls --format=prettyjson acme-gov:sales
bq ls --format=prettyjson acme-gov:finance

1.2 Collect table metadata¶

For each table, record the schema, row count, size, partitioning, and clustering:

# Get table schema and metadata
bq show --format=prettyjson acme-gov:sales.order_lines

# Get row count and size
bq query --nouse_legacy_sql \
  "SELECT table_id, row_count, size_bytes,
          ROUND(size_bytes / POW(1024, 3), 2) AS size_gb
   FROM \`acme-gov.sales.__TABLES__\`
   ORDER BY size_bytes DESC"

1.3 Identify partitioning and clustering¶

# Check partition and cluster columns
bq query --nouse_legacy_sql \
  "SELECT table_name, partition_column, clustering_columns
   FROM \`acme-gov.sales.INFORMATION_SCHEMA.TABLE_OPTIONS\`
   WHERE option_name IN ('partition_expiration_days',
                         'require_partition_filter')"

# Detailed partition info
bq query --nouse_legacy_sql \
  "SELECT table_name,
          column_name AS partition_column,
          data_type AS partition_type
   FROM \`acme-gov.sales.INFORMATION_SCHEMA.COLUMNS\`
   WHERE is_partitioning_column = 'YES'"

1.4 Build inventory spreadsheet¶

Dataset	Table	Rows	Size (GB)	Partition Column	Cluster Columns	Scheduled Query	Priority
sales	order_lines	2.1B	498	order_date	region, product_id	No	High
finance	fact_sales_daily	45M	12	sales_date	region, product_id	Yes (daily 02:00)	High
finance	dim_region	250	0.001	--	--	No	Medium
finance	dim_product	12K	0.01	--	--	No	Medium
finance	dim_date	36K	0.002	--	--	No	Medium

GCP comparison: BigQuery INFORMATION_SCHEMA views provide this metadata natively. On Azure, Unity Catalog information_schema provides equivalent catalog metadata once the tables land in Databricks.

Step 2: Export BigQuery tables to GCS as Parquet¶

2.1 Create a staging GCS bucket¶

# Create a regional bucket near your BigQuery dataset
gsutil mb -l us-east4 -p acme-gov gs://acme-gov-migration-staging/

2.2 Export small tables (< 1 GB) with `bq extract`¶

# Export dimension tables as single Parquet files
bq extract \
  --destination_format=PARQUET \
  --compression=SNAPPY \
  acme-gov:finance.dim_region \
  gs://acme-gov-migration-staging/finance/dim_region/dim_region.parquet

bq extract \
  --destination_format=PARQUET \
  --compression=SNAPPY \
  acme-gov:finance.dim_product \
  gs://acme-gov-migration-staging/finance/dim_product/dim_product.parquet

bq extract \
  --destination_format=PARQUET \
  --compression=SNAPPY \
  acme-gov:finance.dim_date \
  gs://acme-gov-migration-staging/finance/dim_date/dim_date.parquet

2.3 Export large tables with `EXPORT DATA` (partition-aware)¶

For tables over 1 GB, use EXPORT DATA to export as sharded Parquet files, preserving partition structure:

-- Export order_lines partitioned by order_date
EXPORT DATA OPTIONS (
  uri = 'gs://acme-gov-migration-staging/sales/order_lines/*.parquet',
  format = 'PARQUET',
  compression = 'SNAPPY',
  overwrite = true
) AS
SELECT *
FROM `acme-gov.sales.order_lines`
WHERE order_date BETWEEN '2022-01-01' AND '2024-12-31';

For very large tables, export in date-range batches to control GCS costs and allow incremental transfer:

# Export year-by-year for 500 GB table
for YEAR in 2022 2023 2024; do
  bq query --nouse_legacy_sql \
    "EXPORT DATA OPTIONS (
       uri = 'gs://acme-gov-migration-staging/sales/order_lines/${YEAR}/*.parquet',
       format = 'PARQUET',
       compression = 'SNAPPY',
       overwrite = true
     ) AS
     SELECT *
     FROM \`acme-gov.sales.order_lines\`
     WHERE order_date >= '${YEAR}-01-01'
       AND order_date < '$((YEAR+1))-01-01'"
done

2.4 Verify exports¶

# Check file count and total size
gsutil du -sh gs://acme-gov-migration-staging/sales/order_lines/
gsutil ls -l gs://acme-gov-migration-staging/finance/dim_region/

GCP comparison: BigQuery's EXPORT DATA writes Parquet natively. This is the cleanest exit path because Parquet is already the base format for Delta Lake. There is no proprietary format conversion.

Step 3: Transfer GCS to ADLS Gen2¶

You have three options. Choose based on volume and network constraints.

Option A: AzCopy with GCS interop (recommended for < 5 TB)¶

# Authenticate AzCopy to Azure
azcopy login --tenant-id <TENANT_ID>

# Generate a GCS HMAC key (Settings > Interoperability in GCS console)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# Copy from GCS to ADLS Gen2
azcopy copy \
  "https://storage.googleapis.com/acme-gov-migration-staging/sales/" \
  "https://<STORAGE_ACCOUNT>.dfs.core.windows.net/bronze/sales/" \
  --recursive \
  --s2s-preserve-properties=false

Option B: ADF Copy Activity (recommended for automation and > 5 TB)¶

Create an ADF pipeline with a Copy Activity:

Source: Google Cloud Storage linked service (service account key)
Sink: ADLS Gen2 linked service (managed identity)
Format: Parquet (source) to Parquet (sink)
Parallelism: Set DIU to 16-64 depending on volume

{
    "name": "copy_gcs_to_adls",
    "type": "Copy",
    "typeProperties": {
        "source": {
            "type": "ParquetSource",
            "storeSettings": {
                "type": "GoogleCloudStorageReadSettings",
                "recursive": true
            }
        },
        "sink": {
            "type": "ParquetSink",
            "storeSettings": {
                "type": "AzureBlobFSWriteSettings"
            }
        }
    }
}

Option C: Google Transfer Appliance + Azure Data Box (for > 50 TB)¶

For very large datasets where network transfer is impractical:

Request Google Transfer Appliance to export from GCS to a physical appliance
Ship to Azure datacenter region
Use Azure Data Box to ingest into ADLS Gen2
Estimated timeline: 2-4 weeks for physical transfer

3.1 Verify transfer¶

# Check file count and sizes on ADLS Gen2
az storage fs file list \
  --account-name <STORAGE_ACCOUNT> \
  --file-system bronze \
  --path sales/order_lines \
  --query "[].{name:name, size:contentLength}" \
  --output table

GCP comparison: GCS egress charges apply during transfer. Budget approximately $0.12/GB for standard egress from GCS to Azure. OneLake shortcuts can serve as a zero-egress bridge during migration if you need to query GCS data from Azure without copying it first.

Step 4: Create Delta tables in Databricks from Parquet files¶

4.1 Create Unity Catalog structure¶

-- Create a catalog mirroring the BigQuery project
CREATE CATALOG IF NOT EXISTS acme_gov;

-- Create schemas mirroring BigQuery datasets
CREATE SCHEMA IF NOT EXISTS acme_gov.sales;
CREATE SCHEMA IF NOT EXISTS acme_gov.finance;

4.2 Create Delta tables from Parquet (small tables)¶

-- Dimension tables: direct CTAS from Parquet
CREATE TABLE acme_gov.finance.dim_region
USING DELTA
AS SELECT * FROM parquet.`abfss://bronze@<STORAGE>.dfs.core.windows.net/finance/dim_region/`;

CREATE TABLE acme_gov.finance.dim_product
USING DELTA
AS SELECT * FROM parquet.`abfss://bronze@<STORAGE>.dfs.core.windows.net/finance/dim_product/`;

CREATE TABLE acme_gov.finance.dim_date
USING DELTA
AS SELECT * FROM parquet.`abfss://bronze@<STORAGE>.dfs.core.windows.net/finance/dim_date/`;

4.3 Create partitioned Delta table (large tables)¶

-- Create order_lines with partition matching BigQuery
CREATE TABLE acme_gov.sales.order_lines
USING DELTA
PARTITIONED BY (order_date)
AS SELECT * FROM parquet.`abfss://bronze@<STORAGE>.dfs.core.windows.net/sales/order_lines/`;

-- Apply Z-ordering (BigQuery clustering equivalent)
OPTIMIZE acme_gov.sales.order_lines
ZORDER BY (region, product_id);

4.4 Validate row counts¶

-- Compare against BigQuery inventory from Step 1
SELECT 'order_lines' AS table_name, COUNT(*) AS row_count
FROM acme_gov.sales.order_lines
UNION ALL
SELECT 'dim_region', COUNT(*) FROM acme_gov.finance.dim_region
UNION ALL
SELECT 'dim_product', COUNT(*) FROM acme_gov.finance.dim_product
UNION ALL
SELECT 'dim_date', COUNT(*) FROM acme_gov.finance.dim_date;

GCP comparison: BigQuery partitions by date columns natively in DDL. In Databricks, PARTITIONED BY handles the partition column, while OPTIMIZE ... ZORDER BY replaces BigQuery's CLUSTER BY for query acceleration. Z-ordering is a post-write optimization rather than a DDL declaration.

Step 5: Convert BigQuery SQL to dbt models¶

5.1 Initialize dbt project¶

dbt init acme_gov_analytics
cd acme_gov_analytics

Configure profiles.yml for Databricks:

acme_gov_analytics:
    target: dev
    outputs:
        dev:
            type: databricks
            catalog: acme_gov
            schema: finance
            host: <DATABRICKS_HOST>
            http_path: /sql/1.0/warehouses/<WAREHOUSE_ID>
            token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
            threads: 4

5.2 Create staging models¶

Port the BigQuery scheduled query to a dbt incremental model:

models/gold/fact_sales_daily.sql

{{ config(
    materialized='incremental',
    unique_key=['sales_date', 'region', 'product_id'],
    incremental_strategy='merge',
    partition_by=['sales_date'],
    tblproperties={
      'delta.autoOptimize.autoCompact': 'true',
      'delta.autoOptimize.optimizeWrite': 'true'
    }
) }}

SELECT
  DATE(order_ts) AS sales_date,
  region,
  product_id,
  SUM(quantity) AS units_sold,
  SUM(gross_amount) AS gross_amount
FROM {{ source('sales', 'order_lines') }}
{% if is_incremental() %}
WHERE DATE(order_ts) >= DATE_SUB(CURRENT_DATE(), 3)
{% endif %}
GROUP BY 1, 2, 3

5.3 Add schema tests¶

models/gold/schema.yml

version: 2

models:
    - name: fact_sales_daily
      description: "Daily sales rollup by region and product"
      columns:
          - name: sales_date
            tests:
                - not_null
          - name: region
            tests:
                - not_null
          - name: product_id
            tests:
                - not_null
          - name: units_sold
            tests:
                - not_null
          - name: gross_amount
            tests:
                - not_null

5.4 Run and test¶

dbt run --select fact_sales_daily
dbt test --select fact_sales_daily

Step 6: Set up scheduling¶

6.1 Replace BigQuery scheduled queries with Databricks Workflows¶

BigQuery scheduled queries run in the BigQuery UI with a cron expression. The Databricks equivalent is a Workflow job with a schedule trigger.

Databricks Workflow JSON (via REST API or Terraform):

{
    "name": "daily_sales_rollup",
    "tasks": [
        {
            "task_key": "dbt_run",
            "description": "Run dbt fact_sales_daily model",
            "new_cluster": {
                "spark_version": "14.3.x-scala2.12",
                "node_type_id": "Standard_D4s_v3",
                "num_workers": 2
            },
            "libraries": [{ "pypi": { "package": "dbt-databricks==1.7.0" } }],
            "python_wheel_task": {
                "package_name": "dbt",
                "entry_point": "main",
                "parameters": ["run", "--select", "fact_sales_daily"]
            }
        }
    ],
    "schedule": {
        "quartz_cron_expression": "0 0 2 * * ?",
        "timezone_id": "UTC"
    }
}

6.2 Alternative: ADF trigger for cross-system orchestration¶

If you need to coordinate dbt runs with Power BI refresh or Fabric pipelines:

{
    "name": "tr_daily_sales_pipeline",
    "properties": {
        "type": "ScheduleTrigger",
        "typeProperties": {
            "recurrence": {
                "frequency": "Day",
                "interval": 1,
                "startTime": "2025-01-01T02:00:00Z",
                "timeZone": "UTC"
            }
        },
        "pipelines": [
            {
                "pipelineReference": {
                    "referenceName": "pl_daily_sales_refresh",
                    "type": "PipelineReference"
                }
            }
        ]
    }
}

BigQuery scheduling	Azure equivalent	Notes
Scheduled query (cron)	Databricks Workflow schedule	1:1 mapping; cron syntax is identical
Scheduled query (interval)	ADF Schedule Trigger	For cross-service orchestration
On-demand refresh	Databricks Job API / ADF trigger now	REST API or portal

Step 7: Create Power BI report with Direct Lake¶

7.1 Create a Fabric lakehouse shortcut to Delta tables¶

In the Fabric portal:

Open your Fabric workspace
Create a new Lakehouse (or use existing)
Navigate to Get Data > New shortcut > Azure Data Lake Storage Gen2
Point to abfss://gold@<STORAGE>.dfs.core.windows.net/finance/
The Delta tables appear as tables in the lakehouse

7.2 Create a Direct Lake semantic model¶

From the lakehouse, select New semantic model
Select the fact and dimension tables: fact_sales_daily, dim_region, dim_product, dim_date
Define relationships:
- fact_sales_daily[region] to dim_region[region_id]
- fact_sales_daily[product_id] to dim_product[product_id]
- fact_sales_daily[sales_date] to dim_date[date_key]

7.3 Add DAX measures¶

Units Sold = SUM(fact_sales_daily[units_sold])

Gross Revenue = SUM(fact_sales_daily[gross_amount])

Avg Revenue Per Unit = DIVIDE([Gross Revenue], [Units Sold], 0)

YoY Revenue Growth =
VAR CurrentYear = [Gross Revenue]
VAR PriorYear = CALCULATE([Gross Revenue], SAMEPERIODLASTYEAR(dim_date[date_key]))
RETURN DIVIDE(CurrentYear - PriorYear, PriorYear, 0)

GCP comparison: Direct Lake reads Delta files directly from OneLake without importing data into a separate cache. This is conceptually similar to BigQuery BI Engine, which provides an in-memory acceleration layer. The key difference is that Direct Lake works over open Delta format, whereas BI Engine is tightly coupled to BigQuery storage.

Step 8: Validate query results match¶

8.1 Run identical aggregation queries on both platforms¶

BigQuery:

SELECT
  region,
  SUM(units_sold) AS total_units,
  SUM(gross_amount) AS total_revenue,
  COUNT(*) AS row_count
FROM `acme-gov.finance.fact_sales_daily`
WHERE sales_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region
ORDER BY region;

Databricks SQL:

SELECT
  region,
  SUM(units_sold) AS total_units,
  SUM(gross_amount) AS total_revenue,
  COUNT(*) AS row_count
FROM acme_gov.finance.fact_sales_daily
WHERE sales_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region
ORDER BY region;

8.2 Reconciliation checklist¶

Row counts match per table (tolerance: 0%)
Aggregate sums match (tolerance: < 0.01% for floating-point rounding)
Partition pruning works (check query plan with EXPLAIN)
Z-order accelerates filtered queries (compare scan sizes)
dbt incremental runs produce correct delta
Power BI report matches Looker dashboard numbers

BigQuery SQL to Databricks SQL conversion reference¶

This table covers the 20+ most common patterns that need conversion during migration.

Data types¶

BigQuery	Databricks SQL	Notes
`INT64`	`BIGINT`	Direct rename
`FLOAT64`	`DOUBLE`	Direct rename
`NUMERIC` / `BIGNUMERIC`	`DECIMAL(38, 18)`	Specify precision explicitly
`BOOL`	`BOOLEAN`	Direct rename
`STRING`	`STRING`	Identical
`BYTES`	`BINARY`	Direct rename
`DATE`	`DATE`	Identical
`DATETIME`	`TIMESTAMP_NTZ`	BigQuery DATETIME has no timezone; use `TIMESTAMP_NTZ`
`TIMESTAMP`	`TIMESTAMP`	Both are UTC-aware
`TIME`	`STRING` (workaround)	Databricks lacks native TIME type; store as string or interval
`GEOGRAPHY`	`STRING` (WKT)	Use Mosaic or H3 libraries for geospatial in Databricks
`STRUCT<a INT64, b STRING>`	`STRUCT<a: BIGINT, b: STRING>`	Colon syntax differs
`ARRAY<STRING>`	`ARRAY<STRING>`	Identical
`JSON`	`STRING` with `from_json()` / `schema_of_json()`	Parse at read time

Functions¶

BigQuery StandardSQL	Databricks SQL	Notes
`DATE_SUB(d, INTERVAL 3 DAY)`	`DATE_SUB(d, 3)`	No `INTERVAL` keyword in Databricks
`DATE_ADD(d, INTERVAL 3 DAY)`	`DATE_ADD(d, 3)`	Same pattern
`DATE_DIFF(d1, d2, DAY)`	`DATEDIFF(d1, d2)`	Different name and arg order
`TIMESTAMP_DIFF(t1, t2, SECOND)`	`TIMESTAMPDIFF(SECOND, t2, t1)`	Arg order reversed
`FORMAT_DATE('%Y-%m', d)`	`DATE_FORMAT(d, 'yyyy-MM')`	Java-style format codes
`FORMAT_TIMESTAMP(...)`	`DATE_FORMAT(CAST(t AS TIMESTAMP), ...)`	Same pattern
`PARSE_DATE('%Y%m%d', s)`	`TO_DATE(s, 'yyyyMMdd')`	Function name differs
`PARSE_TIMESTAMP(...)`	`TO_TIMESTAMP(s, fmt)`	Function name differs
`SAFE_CAST(x AS INT64)`	`TRY_CAST(x AS BIGINT)`	`SAFE_` prefix becomes `TRY_`
`SAFE_DIVIDE(a, b)`	`TRY_DIVIDE(a, b)` or `a / NULLIF(b, 0)`	`SAFE_` becomes `TRY_`
`IFNULL(a, b)`	`COALESCE(a, b)` or `IFNULL(a, b)`	Both work in Databricks
`ARRAY_AGG(x)`	`COLLECT_LIST(x)`	Different name
`ARRAY_AGG(DISTINCT x)`	`COLLECT_SET(x)`	Set-based for distinct
`UNNEST(arr)`	`EXPLODE(arr)`	`LATERAL VIEW EXPLODE()` or inline `EXPLODE()`
`GENERATE_DATE_ARRAY(...)`	`SEQUENCE(start, stop, INTERVAL 1 DAY)`	Different name
`STRING_AGG(x, ',')`	`CONCAT_WS(',', COLLECT_LIST(x))`	Two-step in Databricks
`REGEXP_EXTRACT(s, r)`	`REGEXP_EXTRACT(s, r, 0)`	Add group index explicitly
`REGEXP_CONTAINS(s, r)`	`s RLIKE r`	Operator syntax in Databricks
`STARTS_WITH(s, prefix)`	`s LIKE 'prefix%'` or `STARTSWITH(s, prefix)`	DBR 13.3+ has `STARTSWITH`
`ENDS_WITH(s, suffix)`	`s LIKE '%suffix'` or `ENDSWITH(s, suffix)`	DBR 13.3+ has `ENDSWITH`

DDL and DML patterns¶

BigQuery	Databricks SQL	Notes
`CREATE OR REPLACE TABLE ... PARTITION BY col CLUSTER BY c1, c2`	`CREATE OR REPLACE TABLE ... PARTITIONED BY (col)` + `OPTIMIZE ... ZORDER BY (c1, c2)`	Clustering is a post-write operation
`MERGE INTO t USING s ON ...`	`MERGE INTO t USING s ON ...`	Identical syntax
`INSERT INTO t SELECT ...`	`INSERT INTO t SELECT ...`	Identical
`EXPORT DATA OPTIONS(...)`	`COPY INTO` or DataFrame API	Different export idiom
`CREATE MATERIALIZED VIEW`	`CREATE MATERIALIZED VIEW` (Databricks) or dbt incremental	Available in Databricks SQL
`@@project_id`	`current_catalog()`	Session context
`@@dataset_id`	`current_schema()`	Session context
Implicit comma cross-join `FROM a, b`	Explicit `CROSS JOIN` required	Edit-heavy but safer

Next steps¶

After completing this tutorial:

Migrate remaining datasets. Apply the same pattern to additional BigQuery datasets, working through your inventory from Step 1.
Set up Auto Loader for ongoing ingestion. If GCS still receives new data during the bridge period, configure Databricks Auto Loader to continuously ingest from OneLake shortcuts.
Port Looker explores. See Tutorial: Convert Looker/LookML to Power BI for the semantic-model migration.
Implement data contracts. Ship a contract.yaml per data product following the pattern in domains/finance/data-products/invoices/contract.yaml.
Review the migration playbook. See GCP to Azure Migration Playbook for the full phased project plan.

Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Migration Playbook | Looker to Power BI Tutorial | Benchmarks | Best Practices