Home > Tutorials > Google BigQuery to Fabric Migration
🔵 Tutorial 44: Google BigQuery → Microsoft Fabric Migration¶
Last Updated: 2026-04-27 | Phase: 14 (Wave 4) | Multi-Cloud Migration Track Status: ✅ Final | Maintainer: Platform Team
Third-party references — publicly sourced, good-faith comparison
This page references non-Microsoft products and services. That information is drawn from each vendor's publicly available documentation and is offered for honest, good-faith comparison only. This is a personal project written from a Microsoft Fabric and Azure perspective; it does not claim expertise in, or authority over, any third-party product, and nothing here is an official statement by, or endorsed by, those vendors. Capabilities, pricing, and features change often — always verify against the vendor's current official documentation. Where a third-party offering is the stronger choice, we say so plainly.
🔵 Tutorial 44: Google BigQuery → Microsoft Fabric Migration¶
| Difficulty | ⭐⭐⭐ Advanced |
| Time | ⏱️ 210-300 minutes (longer than Synapse/Databricks due to cross-cloud egress) |
| Focus | Multi-cloud migration: BigQuery dataset, GCS, Dataflow (Apache Beam), Cloud Composer, Looker → Fabric Warehouse, OneLake (GCS shortcut), Fabric Pipelines, Fabric notebooks, Power BI |
📊 Progress Tracker¶
| Tutorial | Status |
|---|---|
| 00 — Environment Setup → 38 — DOJ Justice | ✅ Complete (Phases 1-13) |
| 41 — Synapse → Fabric | ✅ Complete |
| 42 — Databricks → Fabric | ✅ Complete |
| 43 — Redshift → Fabric | ✅ Complete |
| 44 — BigQuery → Fabric (this tutorial) | 🔵 YOU ARE HERE |
| 45 — On-Prem SSAS/SSIS/SSRS → Fabric | ⬜ |
| Navigation | |
|---|---|
| ⬅️ Previous | 43 — Redshift → Fabric |
| ➡️ Next | 45 — On-Prem SSAS/SSIS/SSRS → Fabric |
📖 Overview¶
Google BigQuery is a high-performance serverless data warehouse with notable strengths — fast ad-hoc scans, separation of storage and compute, and tight integration with the wider Google Cloud ecosystem. Some enterprises nonetheless choose to consolidate analytics on Microsoft Fabric to reduce multi-cloud sprawl, simplify BI delivery, or unify governance with the rest of their Microsoft estate. This tutorial is a practical guide for teams that have decided to make that move — not a claim that either platform is universally superior. It walks through migrating each BigQuery workload type into its Fabric equivalent — including the dialect translation, partitioning/clustering conversion, and the cross-cloud egress strategy that makes or breaks the project economics.
Comparing BigQuery and Fabric¶
The table below summarizes differences based on each vendor's publicly available documentation (as of this doc's date). It is descriptive rather than a verdict; BigQuery's serverless, per-query elasticity, for instance, is a genuine strength for spiky ad-hoc workloads.
| Concern | BigQuery Behavior | Fabric Behavior |
|---|---|---|
| Cloud surface | GCP-native (project, IAM, VPC-SC, billing) | Azure-native; integrates with M365, Entra, Purview |
| Compute model | Slots (on-demand or reserved), per-query billing — elastic for spiky workloads | CU pool — predictable fixed capacity, shared across workloads |
| Storage | BigQuery managed storage + GCS | OneLake unified — one copy, shortcut to GCS during transition |
| BI integration | Looker (separate license/tool) or other BI tools | Power BI Direct Lake — no copy, no refresh, included |
| Real-time | Pub/Sub + Dataflow + BigQuery streaming inserts | Eventstream + Eventhouse native |
| ML lifecycle | BigQuery ML inline + Vertex AI | Fabric AutoML + notebooks + Azure ML interop |
| Governance | Data Catalog + IAM + DLP (separate services) | Microsoft Purview unified across the estate |
| Egress economics | Cross-cloud reads incur egress fees from GCP | Once data is in OneLake, no further cross-cloud egress |
What This Tutorial Covers¶
- Inventory and complexity scoring of your BigQuery project(s)
- BigQuery Standard SQL → T-SQL dialect translation
- BigQuery EXPORT DATA → GCS → OneLake shortcut (egress-aware)
- Partitioning + clustering → Delta partitioning + Z-Order/V-Order
- Authorized views / row-level security → Fabric Warehouse RLS
- Standard SQL UDFs and JS UDFs → T-SQL functions or notebook UDFs
- Materialized views → Materialized Lake Views
- Dataflow (Apache Beam) → Fabric Pipelines or notebooks
- Cloud Composer (Airflow) → Fabric Pipelines or Apache Airflow Job (preview)
- BigQuery ML → Fabric AutoML / notebooks
- Looker → Power BI semantic models
- Slot consumption → CU sizing
- Three-tier validation (row count, hash, query parity)
- Coexistence with both engines reading the same OneLake/GCS data
- Cutover and decommission with egress minimization
🎯 Learning Objectives¶
By the end of this tutorial, you will be able to:
- Run a BigQuery
INFORMATION_SCHEMA+JOBS_BY_PROJECTinventory and produce a complexity score - Translate BigQuery Standard SQL to T-SQL (dialect parity table included)
- Use
EXPORT DATA OPTIONS(...)to land BigQuery contents into GCS as Parquet - Create OneLake shortcuts to GCS to avoid duplicating petabyte-scale data
- Convert BigQuery date / integer / ingestion-time partitioning to Delta partitioning
- Convert clustering keys to Delta Z-Order or V-Order layout
- Migrate Standard SQL UDFs to T-SQL functions; JS UDFs to notebook Python
- Recreate authorized views as Fabric Warehouse row-level security policies
- Convert Dataflow pipelines to Fabric Pipelines or PySpark notebooks
- Convert BigQuery ML models to Fabric AutoML models
- Repoint Looker semantic layer to Power BI Direct Lake
- Size F-SKU capacity from historical slot consumption
- Minimize GCP egress charges during the cutover window
🏗️ Reference Architecture¶
flowchart LR
subgraph BigQuery["🔵 Source: Google Cloud (GCP)"]
BQ[(BigQuery Dataset)]
GCS[(Google Cloud Storage)]
DF[Dataflow / Apache Beam]
CC[Cloud Composer Airflow]
LK[Looker]
BQML[BigQuery ML]
end
subgraph Fabric["⚡ Target: Microsoft Fabric"]
FW[(Fabric Warehouse)]
FL[(Fabric Lakehouse)]
OL[(OneLake)]
FPIPE[Fabric Pipelines]
FNB[Fabric Notebooks]
AML[Fabric AutoML]
PBI[Power BI Direct Lake]
end
subgraph Migration["🔄 Migration Layer"]
Assess[Assessment + Cost Model]
Translate[SQL Dialect Translator]
Export[EXPORT DATA → GCS]
Shortcut[OneLake → GCS Shortcut]
Validate[Validation Suite]
end
BQ --> Assess
DF --> Assess
CC --> Assess
Assess --> Translate
Translate --> FW
BQ -->|EXPORT DATA Parquet| Export
Export --> GCS
GCS -->|shortcut, no copy| Shortcut
Shortcut --> OL
OL --> FL
DF --> FPIPE
DF --> FNB
CC --> FPIPE
BQML --> AML
LK --> PBI
FW --> Validate
FL --> Validate
OL --> Validate Component Mapping (the canonical reference)¶
| BigQuery Component | Fabric Equivalent | Migration Approach |
|---|---|---|
| BigQuery dataset | Fabric Warehouse / Lakehouse | DDL conversion + EXPORT/COPY |
| Slot model (on-demand or reserved) | CU model (F-SKU) | Empirical sizing from query history |
| Date / integer / ingestion-time partitioning | Delta partitioning | Convert to PARTITIONED BY (col) |
| Clustering keys | Delta Z-Order / V-Order | Z-Order on Lakehouse; V-Order auto in Warehouse |
| Authorized views | Row-level security in Fabric | T-SQL CREATE SECURITY POLICY |
| Standard SQL | T-SQL | Dialect translator (see table below) |
| Standard SQL UDF | T-SQL UDF (CREATE FUNCTION) | Direct port for most |
| JavaScript UDF | Notebook Python UDF | Re-author logic in PySpark |
| Materialized views | Materialized Lake Views | Recreate in Lakehouse |
| Scripts / procedures | T-SQL procedures | Dialect translation + BEGIN/END blocks |
| Google Cloud Storage (GCS) | OneLake shortcut to GCS | No data movement during coexistence |
| BigQuery ML | Fabric AutoML / notebooks | Re-train; export coefficients where needed |
| Dataform | Fabric Notebooks + dbt | See dbt feature doc |
| Cloud Composer (Airflow) | Fabric Pipelines OR Apache Airflow Job (preview) | Pipelines for most; Airflow Job to keep DAGs |
| Looker | Power BI semantic models | LookML → Tabular model |
| Looker Studio dashboards | Power BI reports | Manual recreation |
| GCP IAM | Microsoft Entra ID | Map principals; reissue role assignments |
| Pub/Sub + streaming inserts | Eventstream + Eventhouse | Re-author streams |
| Sensitive data discovery (DLP) | Microsoft Purview | Recreate classifications |
📋 Prerequisites¶
- Tutorial 00 (Environment Setup) complete
- Tutorials 01-03 (Bronze/Silver/Gold medallion) understood
- Tutorial 12 (CI/CD with fabric-cicd)
- Tutorial 13 (Migration Planning) for governance overview
- Tutorial 41 (Synapse → Fabric) read once — this tutorial mirrors that flow
- Source: Existing GCP project with BigQuery and GCS read access
- Target: Fabric F64+ capacity provisioned in a region near your GCS region (to minimize cross-cloud latency and egress)
- Service Account on GCP with
roles/bigquery.dataViewer,roles/bigquery.jobUser,roles/storage.objectViewer - Microsoft Entra Service Principal with
Workspace Adminon the Fabric workspace -
gcloudCLI ≥ 470,bqCLI included - Azure CLI ≥ 2.60 with
fabricextension - PowerShell 7.x with
Az.Fabricmodule - Egress budget approved by FinOps — see GCP Egress section below
- Estimated 3-5 hours active time + multi-day coexistence period
⚠️ Gotcha: Cross-cloud egress charges from GCP to Azure can dwarf Fabric capacity costs if you do this wrong. Read the egress section before starting any data movement.
🚀 Step-by-Step¶
Step 1 — Assess Your BigQuery Project¶
Run the assessment script to inventory datasets, tables, query patterns, and slot consumption.
cd tutorials/44-bigquery-to-fabric/
python 01_assessment.py \
--gcp-project "my-analytics-prod" \
--regions "US,EU" \
--output-dir "./assessment-output"
The script issues these underlying queries (you can also run them in the BigQuery console):
-- Dataset and table inventory with size
SELECT
table_catalog AS project_id,
table_schema AS dataset_id,
table_name,
row_count,
size_bytes / POW(1024, 3) AS size_gb,
type AS table_type
FROM `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE
WHERE table_schema NOT IN ('INFORMATION_SCHEMA');
-- Last 30 days of slot consumption per query
SELECT
user_email,
job_type,
statement_type,
TIMESTAMP_TRUNC(creation_time, DAY) AS day,
SUM(total_slot_ms) / 1000 / 60 / 60 AS slot_hours,
COUNT(*) AS query_count,
SUM(total_bytes_processed) / POW(1024, 4) AS tb_processed
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND state = 'DONE'
GROUP BY 1, 2, 3, 4
ORDER BY day DESC, slot_hours DESC;
-- Routines (UDFs and procedures)
SELECT routine_catalog, routine_schema, routine_name, routine_type, language
FROM `my-analytics-prod`.region-us.INFORMATION_SCHEMA.ROUTINES;
-- Materialized views
SELECT * FROM `my-analytics-prod`.region-us.INFORMATION_SCHEMA.MATERIALIZED_VIEWS;
✅ Verification: Output directory contains: -
inventory.csv— every dataset, table, view, routine with size + last-modified -slot_consumption.csv— daily slot-hours per workload (used for F-SKU sizing) -complexity_scores.csv— per-table classification: Easy / Medium / Hard / Very Hard -udfs.json— all routines with body + language (SQL or JS) -unsupported_features.md— BigQuery features without a 1:1 Fabric equivalent💡 Tip: If you have multiple regions (US + EU), run the script per region. BigQuery
INFORMATION_SCHEMAis regional.
Step 2 — Plan Migration Waves¶
Use the assessment to plan migration waves. Migrate small reference tables first, fact tables last.
python 01_assessment.py --command wave-plan \
--inventory ./assessment-output/inventory.csv \
--max-wave-effort-days 10 \
--max-wave-egress-tb 5
The --max-wave-egress-tb flag is unique to BigQuery: each wave is also capped on egress so the GCP bill is predictable per pay period.
The tool produces migration-waves.md with: - Wave 1: Reference dimensions (small, no consumers) - Wave 2: Bronze tables (raw event tables, partition-aligned) - Wave 3: Silver / curated tables - Wave 4: Gold / fact tables (largest egress) - Wave 5: BI semantic models, Looker → Power BI rebinding - Wave 6: BigQuery ML retraining - Wave 7: Decommission BigQuery datasets
⚠️ Gotcha: BigQuery cross-region (e.g.,
UStoEU) charges its own egress within GCP. If you have multi-region datasets, copy them to a single region before exporting to Fabric.
Step 3 — Convert BigQuery Standard SQL → T-SQL¶
This is the core engineering effort. The translator handles ~80% mechanically; the remaining 20% needs manual review.
python 02_sql_translator.py \
--source-sql-folder ./bigquery-sql/ \
--output-tsql-folder ./fabric-tsql/ \
--dialect bigquery \
--target fabric-warehouse
See the BigQuery Standard SQL → T-SQL dialect table below for the full reference.
✅ Verification:
# Each *.sql file should have a matching *.sql in the output folder diff <(ls ./bigquery-sql/) <(ls ./fabric-tsql/)💡 Tip: After translation, run a syntax-only parse against Fabric Warehouse using
sqlcmdwith-Y 0. Catches the obvious errors before you waste compute on execution.
Step 4 — Export BigQuery → GCS¶
Use BigQuery's native EXPORT DATA to land Parquet files in GCS. Parquet is the optimal format for downstream Delta conversion.
-- Run in BigQuery console for each in-scope table
EXPORT DATA OPTIONS(
uri = 'gs://my-fabric-export-bucket/customers/part-*.parquet',
format = 'PARQUET',
compression = 'SNAPPY',
overwrite = true
) AS
SELECT * FROM `my-analytics-prod.warehouse.customers`;
For partitioned tables, export per partition to preserve the partitioning scheme:
-- Partitioned export (one file set per date)
EXPORT DATA OPTIONS(
uri = 'gs://my-fabric-export-bucket/events/dt=2026-04-01/part-*.parquet',
format = 'PARQUET',
compression = 'SNAPPY',
overwrite = true
) AS
SELECT * EXCEPT(_PARTITIONDATE)
FROM `my-analytics-prod.warehouse.events`
WHERE _PARTITIONDATE = DATE '2026-04-01';
A driver script automates this for all in-scope tables and date ranges:
python 03_export_to_gcs.py \
--gcp-project "my-analytics-prod" \
--bucket "my-fabric-export-bucket" \
--tables-yaml ./assessment-output/in-scope-tables.yaml \
--parallelism 16
⚠️ Gotcha: BigQuery
EXPORT DATAis a query job. It consumes slots and runtime. Schedule big exports during off-peak hours so you don't queue against business workloads.💡 Tip: Snappy Parquet compresses well and is the format Delta Lake uses natively. Don't export as CSV unless you have a hard reason — it bloats egress and loses types.
Step 5 — Bring GCS Data into OneLake (Shortcut, Don't Copy)¶
This is the single biggest egress-saving move in the entire migration. Shortcuts read in place — no data is copied across clouds during coexistence.
# Create a OneLake shortcut to a GCS bucket
az fabric onelake-shortcut create \
--workspace-id "$WS_ID" \
--lakehouse-id "$LH_ID" \
--name "bigquery-export" \
--path "Files/bigquery-export" \
--target-gcs-uri "gs://my-fabric-export-bucket/" \
--gcs-access-key-id "$GCS_KEY_ID" \
--gcs-secret-access-key "$GCS_SECRET"
Or via Bicep:
resource shortcutGcs 'Microsoft.Fabric/workspaces/items/shortcuts@2024-01-01' = {
name: 'bigquery-export'
properties: {
target: {
googleCloudStorage: {
bucket: 'my-fabric-export-bucket'
subpath: '/'
connectionId: connectionId
}
}
}
}
What this gives you: - Petabyte-scale data is not copied - GCS retains the source of truth during coexistence - Fabric reads the shortcut as if it were native OneLake - Bidirectional during coexistence: BigQuery and Fabric both read the same data - You only pay GCS egress when Fabric reads the data — and Fabric caches aggressively
✅ Verification: From a Fabric notebook:
df = spark.read.parquet("Files/bigquery-export/customers/") print(df.count()) # Should match BigQuery row count df.printSchema() # Should match BigQuery schema⚠️ Gotcha: GCS shortcuts require a connection with HMAC keys (or workload identity federation in preview). Plan a 1-hour spike to set this up before you script anything.
Step 6 — Load Fabric Warehouse via CTAS / COPY INTO¶
For Warehouse workloads (T-SQL analytics), load the GCS-backed shortcut data into managed Warehouse tables. This is the form most Looker → Power BI replacements need.
-- Option A: COPY INTO from OneLake shortcut
COPY INTO dbo.customers
FROM 'https://onelake.dfs.fabric.microsoft.com/<workspace>/<lakehouse>.Lakehouse/Files/bigquery-export/customers/'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Workspace Identity')
);
-- Option B: CTAS for transformation during load
CREATE TABLE dbo.customers_curated
AS
SELECT
customer_id,
TRY_CAST(signup_ts AS datetime2(6)) AS signup_ts,
UPPER(TRIM(country_code)) AS country_code,
-- BigQuery returned numeric as STRING; cast to DECIMAL
TRY_CAST(lifetime_value AS decimal(18,2)) AS lifetime_value
FROM OPENROWSET(
BULK 'https://onelake.dfs.fabric.microsoft.com/<workspace>/<lakehouse>.Lakehouse/Files/bigquery-export/customers/',
FORMAT = 'PARQUET'
) AS src;
💡 Tip: Lean on CTAS rather than INSERT for first-load performance. CTAS is metadata-bulk; INSERT is row-bulk.
Step 7 — Migrate UDFs (Standard SQL UDFs → T-SQL; JS UDFs → Notebook Functions)¶
BigQuery supports two UDF flavors. Each has a different Fabric path.
Standard SQL UDFs → T-SQL CREATE FUNCTION¶
-- BigQuery
CREATE FUNCTION analytics.fmt_phone(s STRING) AS (
REGEXP_REPLACE(s, r'[^0-9]', '')
);
-- Fabric Warehouse T-SQL
CREATE FUNCTION analytics.fmt_phone(@s VARCHAR(50))
RETURNS VARCHAR(50)
AS
BEGIN
RETURN (
SELECT STRING_AGG(c, '') WITHIN GROUP (ORDER BY n)
FROM (
SELECT n, SUBSTRING(@s, n, 1) AS c
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) AS v(n)
) x
WHERE c LIKE '[0-9]'
);
END;
💡 Tip: Use the simpler T-SQL
TRANSLATEor string manipulation when feasible — avoid scalar UDFs in hot paths because they're row-iterative.
JavaScript UDFs → Notebook Python UDFs¶
JS UDFs in BigQuery do not translate to T-SQL. Re-author them as PySpark UDFs in a Fabric notebook.
# notebooks/silver/migrated_js_udfs.py
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import re
@udf(returnType=StringType())
def normalize_email(email: str) -> str:
if not email:
return None
e = email.strip().lower()
if not re.match(r'^[\w\.-]+@[\w\.-]+\.\w+$', e):
return None
return e
silver_df = bronze_df.withColumn("email_normalized", normalize_email("email_raw"))
silver_df.write.format("delta").mode("overwrite").save("Tables/silver_customers")
⚠️ Gotcha: BigQuery JS UDFs run in a sandboxed V8 with limited libraries. PySpark UDFs run in Python with the full ecosystem. Don't blindly port — many BigQuery JS UDFs exist because SQL was clumsy. In Spark, you may not need the UDF at all.
Step 8 — Convert Partitioning + Clustering → Delta Partitioning + Z-Order¶
BigQuery's partitioning + clustering map cleanly to Delta partitioning + Z-Order.
| BigQuery Feature | Fabric Equivalent | Notes |
|---|---|---|
Date partitioning (PARTITION BY DATE(ts)) | Delta PARTITIONED BY (dt DATE) | Add a derived dt column at ingest |
| Integer-range partitioning | Delta PARTITIONED BY (bucket INT) | Compute the bucket at ingest |
Ingestion-time partitioning (_PARTITIONDATE) | Delta PARTITIONED BY (ingestion_date) | Capture current_date() at ingest |
| Clustering keys (1-4 columns) | Z-Order on Lakehouse / V-Order on Warehouse | Different mechanisms — see below |
# PySpark: convert partitioning + clustering at write time
events_df.write \
.format("delta") \
.partitionBy("dt") \
.save("Tables/silver_events")
# Optimize with Z-Order on what BigQuery had as cluster keys
spark.sql("""
OPTIMIZE silver_events
ZORDER BY (customer_id, product_category)
""")
💡 Tip: V-Order is automatic in Fabric Warehouse and Lakehouse — you don't run
OPTIMIZE V-ORDER. Z-Order, however, is explicit. Use Z-Order on the columns BigQuery had as cluster keys.⚠️ Gotcha: BigQuery allows up to 4,000 partitions per table; Delta is far more permissive but avoid over-partitioning (rule of thumb: each partition should hold at least 1 GB of data).
Step 9 — Convert Dataflow Pipelines → Fabric Pipelines or Notebook¶
Dataflow pipelines (Apache Beam Java/Python) translate to one of two Fabric patterns:
| Dataflow Pattern | Fabric Pattern | When to Use |
|---|---|---|
| Batch Beam pipeline (Parquet/CSV → BQ) | Fabric Pipeline (Copy + Notebook) | Standard ELT |
| Streaming Beam (Pub/Sub → BQ) | Eventstream + Eventhouse | Real-time |
| Complex transforms (windowing, side inputs) | PySpark notebook | Logic-heavy |
| Cloud Composer (Airflow DAG) | Fabric Pipeline OR Apache Airflow Job (preview) | Keep DAG syntax, swap engine |
# Run the Dataflow → Fabric Pipeline converter
python 04_dataflow_to_fabric.py \
--source-dag-folder ./airflow-dags/ \
--target-workspace "$FABRIC_WS_ID" \
--output-dir ./fabric-pipelines/
💡 Tip: For Cloud Composer DAGs you'd rather not rewrite, the Apache Airflow Job (preview) lets you import DAGs almost as-is. Good for big DAG portfolios.
Step 10 — Convert BigQuery ML Models → Fabric AutoML or Notebooks¶
BigQuery ML models live inside the warehouse. They don't migrate directly — you re-train on the migrated dataset.
| BQML Model | Fabric Path |
|---|---|
| Linear / logistic regression | Fabric AutoML or MLlib |
Boosted trees (BOOSTED_TREE_*) | Fabric AutoML or XGBoost in notebook |
DNN_* | Notebook with PyTorch/TensorFlow |
TIME_SERIES_* (ARIMA_PLUS) | Notebook with prophet/statsforecast |
KMEANS | pyspark.ml.clustering.KMeans |
MATRIX_FACTORIZATION | pyspark.ml.recommendation.ALS |
AUTOML_* | Fabric AutoML (closest equivalent) |
| Imported TensorFlow/ONNX models | Notebook + ONNX runtime |
# Example: BQML logistic regression → Fabric AutoML
from synapse.ml.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression
train_df = spark.read.format("delta").load("Tables/silver_churn_features")
model = TrainClassifier(model=LogisticRegression(), labelCol="churned").fit(train_df)
model.write().overwrite().save("Models/churn_model_v1")
Step 11 — Convert Looker → Power BI Semantic Models¶
LookML is conceptually similar to a Tabular semantic model: views ≈ tables, measures ≈ measures, explores ≈ relationships. The conversion is largely manual.
| LookML Construct | Power BI Equivalent |
|---|---|
view: customers { sql_table_name: ... } | Direct Lake table over Fabric Warehouse |
dimension: signup_date { type: date } | Column with date data type |
measure: lifetime_value { type: sum sql: ${TABLE}.ltv } | DAX measure: SUM(...) |
explore: orders { join: customers { ... } } | Model relationships |
| LookML access filters | RLS roles in Power BI |
| Looker dashboards | Power BI reports |
| Looker scheduled emails | Power BI subscriptions |
💡 Tip: If you have many Looker dashboards, prioritize converting the top-20 by usage first. Use Looker's
i__looker.historyto find them.
Step 12 — Size Fabric Capacity from Slot Consumption¶
python 01_assessment.py --command capacity-recommendation \
--slot-history ./assessment-output/slot_consumption.csv \
--reservation-slots 2000 \
--target-utilization 0.65
See the Slot → CU sizing section below for the rationale.
Step 13 — Validate Migration¶
Same three-tier framework used in Tutorial 41 (Synapse → Fabric).
python 05_validation_suite.py \
--source-bq-project "my-analytics-prod" \
--target-warehouse "$FAB_WH_ID" \
--tables-yaml ./tables-to-validate.yaml \
--output-report ./validation-report.html
Tier 1: ROW COUNT ─────▶ count(*) match per table
Tier 2: CONTENT HASH ─────▶ sha256 of canonicalized rows match
Tier 3: QUERY PARITY ─────▶ representative aggregations match within 0.001%
⚠️ Gotcha: BigQuery hashes (
FARM_FINGERPRINT,MD5) are not portable to T-SQL. Use canonicalized concat-and-SHA256 in both engines for parity.✅ Verification: All three tiers pass for ≥ 95% of in-scope tables before declaring migration complete.
Step 14 — Coexistence (Run Both for 2-4 Weeks)¶
During coexistence: - Both BigQuery and Fabric read the same GCS data via OneLake shortcut (zero duplication) - Writes can target both, but BigQuery is typically read-only during this phase - BI consumers move to Power BI in waves, with Looker as a fallback - Daily validation report confirms parity - An egress dashboard tracks GCS → Azure data movement against budget
💡 Tip: If your egress budget is tight, configure Fabric Spark to cache hot tables aggressively (
spark.databricks.io.cache.enabled = true-equivalent in Fabric) — once a table is read once it stays in capacity-local cache.
Step 15 — Cutover and Decommission¶
When validation has been green for 7 consecutive days:
- Freeze writes to BigQuery — pause Dataflow jobs, mark datasets read-only
- Final delta sync — capture last incremental changes via
EXPORT DATAof records changed in last 24 hours - Materialize OneLake — replace shortcuts with managed Delta tables (one-time copy, but ends GCS egress permanently)
- Switch BI — repoint Power BI workspace; archive Looker
- Switch consumers — apps, APIs, integrations
- Monitor for 48 hours — close on-call bridge after no incidents
- Decommission BigQuery — see below
# Materialize a shortcut into managed OneLake (ends GCS egress)
python 06_materialize_shortcut.py \
--shortcut-path "Files/bigquery-export/customers" \
--target-table "Tables/customers" \
--partition-by "dt" \
--zorder "customer_id"
# After 30-day grace period with all validation green:
# 1. Pause BigQuery dataset (no compute charges; storage only)
bq update --description "Decommissioned 2026-04-30" my-analytics-prod:warehouse
# 2. Backup metadata for audit
python 07_backup_bq_metadata.py --project my-analytics-prod --output ./bq-backup/
# 3. After 90 more days, delete dataset (data is in OneLake, not lost)
bq rm -r -f my-analytics-prod:warehouse
# 4. After 180 days total, retire the GCS bucket if no other workloads use it
gsutil rm -r gs://my-fabric-export-bucket/
⚠️ Gotcha: Don't delete the GCS bucket until after you've materialized any OneLake shortcuts that point to it. A live shortcut to a deleted bucket fails reads silently.
🌐 Special Section: BigQuery Standard SQL → T-SQL Dialect Translation¶
This is the most useful single table in the tutorial. It's the dialect cheat-sheet you'll come back to dozens of times during translation.
Casting & Conversion¶
| BigQuery | T-SQL (Fabric Warehouse) | Notes |
|---|---|---|
CAST(x AS INT64) | CAST(x AS BIGINT) | INT64 = BIGINT |
SAFE_CAST(x AS INT64) | TRY_CAST(x AS BIGINT) | Returns NULL on failure |
CAST(x AS FLOAT64) | CAST(x AS FLOAT) | |
CAST(x AS NUMERIC) | CAST(x AS DECIMAL(38,9)) | NUMERIC = DECIMAL(38,9) |
CAST(x AS BIGNUMERIC) | CAST(x AS DECIMAL(38,38)) | DECIMAL precision capped at 38 |
CAST(x AS STRING) | CAST(x AS VARCHAR(MAX)) | UTF-8 native in Fabric |
CAST(x AS BYTES) | CAST(x AS VARBINARY(MAX)) | |
CAST(x AS BOOL) | CAST(x AS BIT) | |
PARSE_DATE('%Y-%m-%d', s) | TRY_PARSE(s AS DATE USING 'en-US') or CONVERT(DATE, s, 23) | Use CONVERT for ISO dates |
PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', s) | TRY_CAST(s AS DATETIME2(6)) | ISO formats parse natively |
FORMAT_DATE('%Y-%m', d) | FORMAT(d, 'yyyy-MM') | |
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', t) | FORMAT(t, 'yyyy-MM-dd HH:mm:ss') |
Date / Time Functions¶
| BigQuery | T-SQL | Notes |
|---|---|---|
CURRENT_DATE() | CAST(GETDATE() AS DATE) or CAST(SYSUTCDATETIME() AS DATE) | Fabric is UTC by default |
CURRENT_TIMESTAMP() | SYSUTCDATETIME() | |
DATE_ADD(d, INTERVAL 7 DAY) | DATEADD(DAY, 7, d) | |
DATE_DIFF(d1, d2, DAY) | DATEDIFF(DAY, d2, d1) | Argument order flipped |
DATE_TRUNC(d, MONTH) | DATEFROMPARTS(YEAR(d), MONTH(d), 1) | No direct DATE_TRUNC; use DATETRUNC in Fabric |
TIMESTAMP_TRUNC(t, HOUR) | DATETRUNC(HOUR, t) | Fabric supports DATETRUNC |
EXTRACT(DAYOFWEEK FROM d) | DATEPART(WEEKDAY, d) | DOW numbering may differ |
LAST_DAY(d, MONTH) | EOMONTH(d) | |
UNIX_SECONDS(t) | DATEDIFF_BIG(SECOND, '1970-01-01', t) |
String Functions¶
| BigQuery | T-SQL | Notes |
|---|---|---|
CONCAT(a, b, c) | CONCAT(a, b, c) | Direct |
STRING_AGG(x, ',') | STRING_AGG(x, ',') | Direct |
ARRAY_TO_STRING(arr, ',') | STRING_AGG(value, ',') over OPENJSON | T-SQL has no native array |
SPLIT(s, ',') | STRING_SPLIT(s, ',') | Returns table |
REGEXP_CONTAINS(s, r'^abc') | s LIKE 'abc%' or use regex CLR | T-SQL has limited regex |
REGEXP_EXTRACT(s, r'(\d+)') | Custom CLR or notebook UDF | No native regex extract |
REGEXP_REPLACE(s, r'\d+', 'X') | Custom CLR or notebook UDF | No native; preview function exists |
LENGTH(s) | LEN(s) | LEN trims trailing spaces — gotcha! |
LOWER(s) / UPPER(s) | LOWER(s) / UPPER(s) | Direct |
TRIM(s) | TRIM(s) | Direct |
LPAD(s, 10, '0') | RIGHT(REPLICATE('0', 10) + s, 10) | No LPAD; emulate |
STARTS_WITH(s, 'foo') | s LIKE 'foo%' | |
ENDS_WITH(s, 'bar') | s LIKE '%bar' |
Numeric / Aggregate Functions¶
| BigQuery | T-SQL | Notes |
|---|---|---|
IF(cond, a, b) | IIF(cond, a, b) | |
IFNULL(x, y) | ISNULL(x, y) or COALESCE(x, y) | |
NULLIF(x, y) | NULLIF(x, y) | Direct |
COALESCE(...) | COALESCE(...) | Direct |
SAFE_DIVIDE(a, b) | CASE WHEN b = 0 THEN NULL ELSE a / b END | No SAFE_ namespace |
MOD(x, y) | x % y | |
APPROX_COUNT_DISTINCT(x) | APPROX_COUNT_DISTINCT(x) | T-SQL has it (HyperLogLog) |
APPROX_QUANTILES(x, 100) | PERCENTILE_CONT(...) | Approx not exact-equivalent |
ANY_VALUE(x) | MIN(x) or FIRST_VALUE(x) | |
COUNTIF(cond) | SUM(CASE WHEN cond THEN 1 ELSE 0 END) |
STRUCT and ARRAY (the hardest part)¶
T-SQL has no native STRUCT or ARRAY type. You have three options.
| BigQuery Pattern | Fabric Option |
|---|---|
STRUCT(a, b, c) (composite column) | (1) Flatten to scalar columns, or (2) Store as JSON VARCHAR(MAX) and use JSON_VALUE/OPENJSON |
ARRAY<INT64> column | Store as JSON array string and use OPENJSON to unnest |
UNNEST(arr) | CROSS APPLY OPENJSON(arr_json) WITH (value INT '$') |
| Nested repeated record | Move to a separate child table (relational normalization) |
-- BigQuery
SELECT customer_id, item.product, item.qty
FROM orders, UNNEST(items) AS item;
-- T-SQL (assuming items stored as JSON)
SELECT o.customer_id, item.product, item.qty
FROM orders AS o
CROSS APPLY OPENJSON(o.items) WITH (
product VARCHAR(100) '$.product',
qty INT '$.qty'
) AS item;
💡 Tip: For very nested data, use the Lakehouse path (PySpark + Delta) instead of Warehouse. Spark handles complex types natively. Warehouse is best for already-flattened relational data.
Window Functions¶
Most window functions translate directly. Watch for:
| BigQuery | T-SQL | Notes |
|---|---|---|
OVER (PARTITION BY x ORDER BY y ROWS BETWEEN ...) | OVER (PARTITION BY x ORDER BY y ROWS BETWEEN ...) | Direct |
LAG(x, 1, 0) OVER (...) | LAG(x, 1, 0) OVER (...) | Direct |
LEAD(x, 1, 0) OVER (...) | LEAD(x, 1, 0) OVER (...) | Direct |
QUALIFY row_number() OVER (...) = 1 | Wrap in CTE + WHERE | T-SQL has no QUALIFY |
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW | RANGE BETWEEN ... ROWS only | T-SQL window has limited RANGE |
-- BigQuery with QUALIFY
SELECT *
FROM events
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) = 1;
-- T-SQL equivalent
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn
FROM events
)
SELECT * FROM ranked WHERE rn = 1;
💸 Special Section: GCP Network Egress Costs¶
Egress fees from GCS to Azure are real money. A naïve migration on a multi-TB warehouse will spike your GCP bill by tens of thousands of dollars. Plan this carefully.
Egress Pricing (GCS → Internet, as of 2026)¶
| Tier | Price |
|---|---|
| First 1 TB / month | ~\(0.12/GB (\)120/TB) |
| 1-10 TB / month | ~$0.11/GB |
| 10-100 TB / month | ~$0.08/GB |
| Egress to non-GCP, same continent | typically lower if Azure region is in same continent |
💡 GCP publishes a Network Service Tiers page; FinOps should pull current prices for your region pair before signing off.
Minimization Strategies (in priority order)¶
| Strategy | Egress Saved | Effort |
|---|---|---|
| Use OneLake shortcuts during coexistence | ~70% — no copy until cutover; reads pull what's needed | Low |
| Compress with Snappy Parquet | ~5-10x vs CSV | Low |
Filter columns at export (SELECT only-needed) | proportional to width | Low |
| Filter rows at export (date range, dedupe) | proportional to depth | Medium |
| Co-locate Azure region with GCP region | ~20-40% (egress tier) | Medium |
| Schedule big exports during GCP discount windows | varies | Low |
| Use a Cloud Storage Transfer Service to Azure Blob | sometimes 0 egress for committed-use | High |
| Don't materialize shortcut data — read-in-place forever | continuous savings | Low (but ties you to GCP indefinitely) |
Egress Budget Worksheet¶
Run this against the assessment data before kicking off any export:
python 01_assessment.py --command egress-estimate \
--inventory ./assessment-output/inventory.csv \
--target-region eastus \
--strategy compressed-parquet
Output:
=== Egress Estimate ===
In-scope dataset size: 18.4 TB (Snappy Parquet)
Estimated egress on full copy: $1,840 - $2,210 (one-time)
With shortcuts during coexist: $185 (read-on-demand only)
Recommended budget: $2,500 (includes 20% buffer)
⚠️ Gotcha: Egress is per project. If your data spans multiple GCP projects, sum across all of them.
📏 Special Section: Slot Consumption → CU Sizing¶
BigQuery uses slots; Fabric uses Capacity Units (CUs). They're fundamentally different — but you can use historical slot consumption to make a defensible starting recommendation, then validate empirically during coexistence.
Conceptual Mapping¶
| BigQuery Concept | Fabric Concept |
|---|---|
| Slot (a unit of CPU + memory) | CU (a unit of capacity across all workloads) |
| 100 slots ≈ 1 vCPU equivalent (rough) | F-SKU = ~2 CUs per F-unit |
| On-demand pricing (per-byte processed) | Capacity-fixed pricing (predictable) |
| Reservation (committed slots) | F-SKU reservation |
| Slots are warehouse-only | CUs are pooled across Spark, Warehouse, BI, RTI |
Starting Recommendation (Validate Empirically)¶
| BigQuery Slot Reservation | Approximate Fabric F-SKU | Rationale |
|---|---|---|
| 100 slots | F8 | Light analytical workload |
| 500 slots | F16-F32 | Mid-tier; depends on BI footprint |
| 1,000 slots | F32-F64 | Standard enterprise warehouse |
| 2,000 slots | F64 | Default starting point for most BQ migrations |
| 5,000 slots | F128 | Large enterprise |
| 10,000 slots | F256 | Multi-team analytics platform |
| 20,000 slots | F512 | Hyperscale |
⚠️ Gotcha: Unlike BigQuery, Fabric CUs are shared across all workloads (Spark, Warehouse, Power BI, Eventhouse, RTI). If your migration brings BI consolidation onto Fabric, size higher than the slot equivalent suggests.
Empirical Sizing — The Right Way¶
# During coexistence, monitor CU usage
python 08_capacity_telemetry.py \
--workspace "$FABRIC_WS_ID" \
--window-days 7 \
--output ./capacity-telemetry.csv
Read the Capacity Planning best practices doc for the full framework.
💡 Tip: Fabric's capacity metrics app is free and shows hour-by-hour CU usage. After the first week, you'll know whether your initial F-SKU pick was right.
✅ Verification (Final Checklist)¶
- All in-scope BigQuery tables have Fabric Warehouse / Lakehouse equivalents
- Three-tier validation (count, hash, query parity) green for 7+ consecutive days
- OneLake shortcuts to GCS working at performance parity
- All Standard SQL UDFs converted to T-SQL functions
- All JS UDFs re-authored as PySpark UDFs in notebooks
- All authorized views recreated as RLS policies
- Materialized views recreated as Materialized Lake Views
- Dataflow pipelines converted to Fabric Pipelines or notebooks
- BigQuery ML models retrained on Fabric AutoML
- Looker dashboards rebuilt as Power BI reports
- Power BI semantic models on Direct Lake
- Apps, APIs, integrations cut over
- F-SKU sized correctly (CU usage in steady-state band)
- Egress spend matches FinOps budget
- BigQuery datasets paused (or deleted post-grace-period)
- GCS bucket retained for grace period or retired with retention copy
- Cutover postmortem published
🧹 Cleanup¶
This tutorial creates assessment artifacts in ./assessment-output/, translation artifacts in ./fabric-tsql/, and pipeline artifacts in ./fabric-pipelines/. Retain for audit; archive to long-term storage.
If your assessment ran against a real BigQuery project, the read-only INFORMATION_SCHEMA queries leave no trace. No rollback needed for assessment.
If EXPORT DATA was run and you want to abandon the migration, delete the GCS export bucket (gsutil rm -r gs://my-fabric-export-bucket/). No source data is altered by export.
🚦 Next Steps¶
- Tutorial 45 — On-Prem SSAS/SSIS/SSRS → Fabric — for legacy SQL Server BI stacks
- Tutorial 41 — Synapse → Fabric — sister migration playbook (Microsoft-to-Microsoft)
- Tutorial 42 — Databricks → Fabric — for shops with Databricks alongside BigQuery
- Tutorial 43 — Redshift → Fabric — sister AWS-to-Azure playbook
- Migration Patterns Best Practices — cross-source patterns
- fabric-cicd Deployment — automate the deployment of migrated artifacts
- Capacity Planning — validate F-SKU choice with telemetry
- dbt Fabric Integration — replacement for Dataform
🛠️ Troubleshooting¶
| Issue | Cause | Resolution |
|---|---|---|
| Egress spend exceeds budget by 5x | Cross-region GCS reads or full materialization | Switch to shortcuts; co-locate region; filter columns |
OPENROWSET fails on GCS shortcut | Connection IAM lacks storage.objects.get | Grant roles/storage.objectViewer to the connection SA |
| Row counts off by exactly partition size | Partition not exported (_PARTITIONDATE filter dropped a day) | Re-export missing partition; verify EXPORT DATA predicate |
STRUCT column lost on import | Parquet logical type not mapped | Read into Lakehouse with PySpark; flatten before Warehouse load |
ARRAY column unreadable in T-SQL | T-SQL has no native array | Land as JSON VARCHAR(MAX); use OPENJSON |
| Timestamps off by 4-8 hours | BigQuery TIMESTAMP is UTC; T-SQL DATETIME2 has no TZ | Cast at ingest; or use DATETIMEOFFSET consistently |
PARSE_DATE translations failing | Locale-dependent format strings | Use CONVERT(DATE, s, <style>) with explicit ISO style |
QUALIFY translations missed | T-SQL has no QUALIFY clause | Wrap in CTE + WHERE rn = 1 |
| Shortcut returns 0 rows after working yesterday | GCS HMAC key rotated or expired | Refresh credentials; recreate connection |
| BigQuery ML model metrics differ from Fabric AutoML | Different algorithms, different feature engineering defaults | Validate against business KPIs, not ML metrics |
| F-SKU throttling after migration | Concurrent BI + Spark + Warehouse contention | Scale up; review capacity throttling runbook |
| Looker dashboard expressions don't translate | LookML PDTs not in Fabric | Recreate PDT logic as Lakehouse table or DAX measure |
| Cross-cloud query latency unacceptable | Shortcut reads pulling fresh from GCS each query | Materialize hot tables to managed Delta; rely on V-Order |
| Time zone columns off by hours | BigQuery TIMESTAMP is always UTC; T-SQL DATETIME2 is naive | Standardize on DATETIMEOFFSET or always-UTC convention |
📁 Key Files Referenced¶
| Step | File |
|---|---|
| 1, 2, 12 | tutorials/44-bigquery-to-fabric/01_assessment.py |
| 3 | tutorials/44-bigquery-to-fabric/02_sql_translator.py |
| 4 | tutorials/44-bigquery-to-fabric/03_export_to_gcs.py |
| 9 | tutorials/44-bigquery-to-fabric/04_dataflow_to_fabric.py |
| 13 | tutorials/44-bigquery-to-fabric/05_validation_suite.py |
| 15 | tutorials/44-bigquery-to-fabric/06_materialize_shortcut.py |
| 15 | tutorials/44-bigquery-to-fabric/07_backup_bq_metadata.py |
| 12 | tutorials/44-bigquery-to-fabric/08_capacity_telemetry.py |
📚 References¶
Google Cloud Documentation¶
- BigQuery
INFORMATION_SCHEMAreference - BigQuery
EXPORT DATAstatement - BigQuery Standard SQL reference
- GCS HMAC keys for cross-cloud access
- GCP Network Service Tiers
Microsoft Fabric Documentation¶
- BigQuery to Fabric migration overview
- Fabric Warehouse T-SQL surface area
- OneLake shortcuts (Google Cloud Storage)
- Fabric Pipelines
- Apache Airflow Job (preview)
- Materialized Lake Views
- Fabric AutoML
Related Tutorials & Docs¶
- Tutorial 41 — Synapse → Fabric (Wave 4 anchor)
- Tutorial 42 — Databricks → Fabric (Wave 4)
- Tutorial 43 — Redshift → Fabric (Wave 4)
- Tutorial 24 — Snowflake → Fabric (existing)
- Migration Patterns
- fabric-cicd Deployment
- Capacity Planning
- dbt Fabric Integration