Storage Migration: GCS to ADLS Gen2 and OneLake¶
A hands-on guide for data engineers migrating Google Cloud Storage buckets and BigQuery managed storage to Azure Data Lake Storage Gen2 and OneLake.
Scope¶
This guide covers:
- GCS bucket migration to ADLS Gen2 containers
- BigQuery managed storage export to Delta Lake
- Object lifecycle policy translation
- Bridge patterns using OneLake shortcuts
- Worked examples with CLI commands
For BigQuery compute migration (SQL, ML, scheduling), see Compute Migration.
Architecture overview¶
flowchart LR
subgraph GCP["GCP Storage"]
GCS[GCS Buckets]
BQMS[BigQuery Managed Storage]
end
subgraph Bridge["Migration Bridge"]
OLS[OneLake Shortcuts to GCS]
ADF[ADF Copy Activity]
AZC[AzCopy]
end
subgraph Azure["Azure Storage"]
ADLS[ADLS Gen2 Containers]
OL[OneLake / Lakehouse]
DL[Delta Lake Tables]
end
GCS --> OLS --> OL
GCS --> AZC --> ADLS
GCS --> ADF --> ADLS
BQMS --> ADF --> DL
ADLS --> OL GCS buckets to ADLS Gen2 containers¶
Conceptual mapping¶
| GCS concept | ADLS Gen2 equivalent | Notes |
|---|---|---|
| Project | Subscription + Resource Group | Organizational container |
| Bucket | Storage Account + Container | ADLS uses hierarchical namespace |
| Object | Blob (with directory structure) | ADLS supports true directories |
| Object prefix (pseudo-folder) | Directory | True directory operations on ADLS |
| Storage class (Standard) | Hot tier | Default access tier |
| Storage class (Nearline) | Cool tier | 30-day minimum |
| Storage class (Coldline) | Cold tier | 90-day minimum (newer than Cool) |
| Storage class (Archive) | Archive tier | Offline retrieval |
| Signed URL | SAS token | Time-limited authenticated access |
| IAM binding | Azure RBAC assignment | Storage Blob Data Reader/Contributor |
| Service account | Managed Identity | No credential management needed |
Naming conventions¶
GCS bucket names are globally unique. ADLS container names are scoped to the storage account. Translate as follows:
GCS: gs://acme-gov-analytics-raw/
ADLS: https://stacmegov.dfs.core.windows.net/raw/
GCS: gs://acme-gov-analytics-curated/
ADLS: https://stacmegov.dfs.core.windows.net/curated/
GCS: gs://acme-gov-analytics-archive/
ADLS: https://stacmegov.dfs.core.windows.net/archive/
Migration method 1: AzCopy (recommended for bulk transfer)¶
AzCopy supports GCS-to-ADLS direct copy using GCS interoperability credentials (HMAC keys).
Step 1: Generate GCS HMAC keys
# In GCP Console or via gcloud
gcloud storage hmac create sa-migration@acme-gov.iam.gserviceaccount.com
# Note the Access Key and Secret
Step 2: Set environment variables
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
export GOOGLE_CLOUD_PROJECT=acme-gov
Step 3: Run AzCopy
# Copy entire bucket to ADLS container
azcopy copy \
"https://storage.cloud.google.com/acme-gov-analytics-raw/" \
"https://stacmegov.blob.core.windows.net/raw/?<SAS-TOKEN>" \
--recursive=true \
--s2s-preserve-properties=false \
--include-pattern "*.parquet;*.csv;*.json" \
--log-level=INFO
# Verify transfer
azcopy list "https://stacmegov.blob.core.windows.net/raw/?<SAS-TOKEN>" --machine-readable
Step 4: Validate file counts and sizes
# GCS side
gsutil du -s gs://acme-gov-analytics-raw/
gsutil ls -l gs://acme-gov-analytics-raw/ | wc -l
# ADLS side
az storage blob list \
--account-name stacmegov \
--container-name raw \
--auth-mode login \
--query "[].{name:name, size:properties.contentLength}" \
--output table | wc -l
Migration method 2: ADF Copy Activity (recommended for ongoing sync)¶
ADF provides a managed copy pipeline with monitoring, retry, and scheduling.
Step 1: Create a GCS linked service in ADF
{
"name": "ls_gcs_source",
"type": "GoogleCloudStorage",
"typeProperties": {
"accessKeyId": "<HMAC-ACCESS-KEY>",
"secretAccessKey": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "ls_keyvault" },
"secretName": "gcs-hmac-secret"
}
}
}
Step 2: Create copy pipeline
{
"name": "pl_gcs_to_adls",
"activities": [
{
"name": "CopyFromGCS",
"type": "Copy",
"inputs": [{ "referenceName": "ds_gcs_parquet" }],
"outputs": [{ "referenceName": "ds_adls_parquet" }],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings": {
"type": "GoogleCloudStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "*",
"wildcardFileName": "*.parquet"
}
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
}
}
}
]
}
Migration method 3: OneLake shortcuts (bridge pattern)¶
OneLake shortcuts provide zero-copy read access to GCS during the migration bridge phase. No egress charges for reads -- you pay only when data is physically moved.
This allows Azure workloads to query GCS data without migrating it first. Use this for:
- Parallel validation during migration
- Low-priority datasets that migrate last
- Datasets that may stay on GCS indefinitely
BigQuery managed storage export to Delta Lake¶
BigQuery stores data in the proprietary Capacitor format. It cannot be read outside BigQuery. You must export before migrating.
Method 1: BigQuery EXPORT DATA to GCS, then to ADLS¶
Step 1: Export from BigQuery to GCS as Parquet
EXPORT DATA OPTIONS (
uri = 'gs://acme-gov-exports/finance/fact_sales_daily/*.parquet',
format = 'PARQUET',
overwrite = true,
compression = 'SNAPPY'
) AS
SELECT * FROM `acme-gov.finance.fact_sales_daily`;
Step 2: Copy from GCS to ADLS using AzCopy
azcopy copy \
"https://storage.cloud.google.com/acme-gov-exports/finance/fact_sales_daily/" \
"https://stacmegov.blob.core.windows.net/bronze/finance/fact_sales_daily/?<SAS>" \
--recursive=true
Step 3: Convert Parquet to Delta using Databricks
-- In Databricks SQL
CREATE TABLE finance.fact_sales_daily
USING DELTA
LOCATION 'abfss://bronze@stacmegov.dfs.core.windows.net/finance/fact_sales_daily/'
AS SELECT * FROM parquet.`abfss://bronze@stacmegov.dfs.core.windows.net/finance/fact_sales_daily/`;
-- Add partitioning and Z-ordering
ALTER TABLE finance.fact_sales_daily SET TBLPROPERTIES (
'delta.autoOptimize.autoCompact' = 'true',
'delta.autoOptimize.optimizeWrite' = 'true'
);
OPTIMIZE finance.fact_sales_daily ZORDER BY (region, product_id);
Method 2: ADF BigQuery connector (direct)¶
ADF can read BigQuery tables directly and write to ADLS/Delta.
{
"name": "pl_bigquery_to_delta",
"activities": [
{
"name": "CopyBigQueryTable",
"type": "Copy",
"inputs": [{ "referenceName": "ds_bigquery_table" }],
"outputs": [{ "referenceName": "ds_adls_delta" }],
"typeProperties": {
"source": {
"type": "GoogleBigQueryV2Source",
"query": "SELECT * FROM `acme-gov.finance.fact_sales_daily`"
},
"sink": {
"type": "ParquetSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
}
}
}
]
}
Follow with the Databricks conversion step to create Delta tables.
Method 3: Databricks GCS connector (read in place)¶
Databricks can read GCS directly using the GCS connector, allowing you to create Delta tables without an intermediate ADF step.
# In Databricks notebook
# Configure GCS access
spark.conf.set("fs.gs.project.id", "acme-gov")
spark.conf.set("fs.gs.auth.service.account.enable", "true")
spark.conf.set("fs.gs.auth.service.account.json.keyfile", "/dbfs/secrets/gcs-sa.json")
# Read from GCS, write as Delta
df = spark.read.parquet("gs://acme-gov-exports/finance/fact_sales_daily/")
df.write.format("delta") \
.mode("overwrite") \
.partitionBy("sales_date") \
.save("abfss://gold@stacmegov.dfs.core.windows.net/finance/fact_sales_daily/")
Object lifecycle policy translation¶
GCS lifecycle rule to ADLS lifecycle management¶
GCS lifecycle rule (JSON):
{
"lifecycle": {
"rule": [
{
"action": {
"type": "SetStorageClass",
"storageClass": "NEARLINE"
},
"condition": { "age": 30 }
},
{
"action": {
"type": "SetStorageClass",
"storageClass": "COLDLINE"
},
"condition": { "age": 90 }
},
{
"action": {
"type": "SetStorageClass",
"storageClass": "ARCHIVE"
},
"condition": { "age": 365 }
},
{
"action": { "type": "Delete" },
"condition": { "age": 2555 }
}
]
}
}
Equivalent ADLS lifecycle policy (Bicep):
resource lifecyclePolicy 'Microsoft.Storage/storageAccounts/managementPolicies@2023-01-01' = {
name: 'default'
parent: storageAccount
properties: {
policy: {
rules: [
{
name: 'tierToCool'
type: 'Lifecycle'
definition: {
actions: {
baseBlob: { tierToCool: { daysAfterModificationGreaterThan: 30 } }
}
filters: { blobTypes: [ 'blockBlob' ] }
}
}
{
name: 'tierToArchive'
type: 'Lifecycle'
definition: {
actions: {
baseBlob: { tierToArchive: { daysAfterModificationGreaterThan: 365 } }
}
filters: { blobTypes: [ 'blockBlob' ] }
}
}
{
name: 'deleteOld'
type: 'Lifecycle'
definition: {
actions: {
baseBlob: { delete: { daysAfterModificationGreaterThan: 2555 } }
}
filters: { blobTypes: [ 'blockBlob' ] }
}
}
]
}
}
}
Per-bucket migration decisions¶
Not every GCS bucket needs to move to ADLS. Use this decision tree:
| Bucket profile | Recommendation | Rationale |
|---|---|---|
| Active analytics data | Migrate to ADLS Gen2 | Needed for Delta Lake + Databricks queries |
| Archive-only (cold storage) | Keep on GCS with Archive class | Avoid egress cost; access via OneLake shortcut if needed |
| Shared with other GCP workloads | OneLake shortcut (bridge) | Zero-copy read access from Azure |
| Large cold volume (100+ TB) | Azure Data Box | Physical transfer avoids egress cost |
| Small active dataset (< 1 TB) | AzCopy direct | Simple, fast, low cost |
| Running pipeline output | Migrate pipeline first, then storage follows | Pipeline determines where data lands |
Validation checklist¶
After each bucket migration, validate:
- File count matches between GCS and ADLS
- Total size matches (accounting for compression differences)
- Sample file content matches (checksums on random sample)
- ADLS lifecycle policies are configured
- RBAC assignments are applied (Storage Blob Data Reader/Contributor)
- Private endpoint is configured (if required)
- Delta tables are created and queryable from Databricks
- Purview scan discovers the new assets
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Compute Migration | Complete Feature Mapping | Migration Playbook