Hive to dbt / SparkSQL Migration¶
A comprehensive guide for migrating Apache Hive workloads to dbt models and SparkSQL on Databricks or Microsoft Fabric, covering metastore migration, HiveQL conversion, UDF porting, and worked before/after examples.
Overview¶
Apache Hive is the most common SQL interface in Hadoop environments. Migrating Hive involves three distinct concerns:
- Metastore migration — Moving table definitions, schemas, partitions, and statistics from the Hive Metastore to Unity Catalog or Purview
- SQL migration — Converting HiveQL to SparkSQL (minor differences) and modernizing to dbt models
- Operational migration — Replacing HiveServer2, Hive LLAP, Tez, and scheduled Hive scripts with Databricks SQL, Fabric SQL endpoints, and dbt-core
This guide addresses all three in detail.
1. Metastore migration: Hive Metastore to Unity Catalog¶
Hive Metastore architecture¶
The Hive Metastore Service (HMS) stores metadata in a relational database (MySQL, PostgreSQL, or Derby):
- Database definitions
- Table schemas (columns, types, SerDe info)
- Partition metadata (partition values, locations, statistics)
- Table properties and parameters
- UDF registrations
Unity Catalog architecture¶
Databricks Unity Catalog provides a three-level namespace that maps directly to Hive's model:
| Hive concept | Unity Catalog equivalent |
|---|---|
| Database | Schema (within a catalog) |
| Table | Table (managed or external) |
| Partition | Partition (Delta handles automatically) |
| View | View |
| UDF | UDF (registered per catalog) |
| Table properties | Table properties (TBLPROPERTIES) |
Export Hive DDL¶
# Export all DDL from Hive Metastore
hive -e "SHOW DATABASES;" | while read db; do
echo "-- Database: $db"
hive -e "USE $db; SHOW TABLES;" | while read tbl; do
hive -e "USE $db; SHOW CREATE TABLE $tbl;"
echo ";"
done
done > hive_ddl_export.sql
Convert DDL to Unity Catalog¶
# Python script to convert Hive DDL to Unity Catalog DDL
import re
def convert_hive_to_unity(hive_ddl, catalog="main", adls_base="abfss://silver@storage.dfs.core.windows.net"):
"""Convert Hive CREATE TABLE to Unity Catalog Delta table."""
# Replace STORED AS ORC/PARQUET with USING DELTA
ddl = re.sub(
r"STORED AS (ORC|PARQUET|AVRO|TEXTFILE|SEQUENCEFILE|RCFILE)",
"USING DELTA",
hive_ddl,
flags=re.IGNORECASE
)
# Replace ROW FORMAT SERDE ... WITH SERDEPROPERTIES (...)
ddl = re.sub(
r"ROW FORMAT SERDE\s+'[^']+'\s*(WITH SERDEPROPERTIES\s*\([^)]+\))?",
"",
ddl,
flags=re.IGNORECASE
)
# Replace ROW FORMAT DELIMITED ...
ddl = re.sub(
r"ROW FORMAT DELIMITED\s+FIELDS TERMINATED BY\s+'[^']+'\s*"
r"(COLLECTION ITEMS TERMINATED BY\s+'[^']+')?\s*"
r"(MAP KEYS TERMINATED BY\s+'[^']+')?\s*"
r"(LINES TERMINATED BY\s+'[^']+')?\s*",
"",
ddl,
flags=re.IGNORECASE
)
# Replace HDFS location with ADLS
ddl = re.sub(
r"LOCATION\s+'hdfs://[^']+/([^']+)'",
f"LOCATION '{adls_base}/\\1'",
ddl,
flags=re.IGNORECASE
)
# Add catalog prefix to database.table references
ddl = re.sub(
r"CREATE (EXTERNAL )?TABLE\s+(\w+)\.(\w+)",
f"CREATE TABLE {catalog}.\\2.\\3",
ddl,
flags=re.IGNORECASE
)
# Remove EXTERNAL keyword (Unity Catalog manages this differently)
ddl = ddl.replace("CREATE EXTERNAL TABLE", "CREATE TABLE")
return ddl
Migrate partitions¶
For tables with thousands of partitions, use the Hive Metastore API or MSCK REPAIR after data migration:
-- After data migration to ADLS, repair partition metadata
-- In Databricks (Unity Catalog)
MSCK REPAIR TABLE silver.orders;
-- Or for Delta tables, partitions are discovered automatically
-- No MSCK REPAIR needed if using CONVERT TO DELTA
2. HiveQL to SparkSQL: key differences¶
SparkSQL is highly compatible with HiveQL. Most queries run without changes. The differences that matter are listed below.
Syntax differences¶
| HiveQL | SparkSQL | Notes |
|---|---|---|
LATERAL VIEW explode(arr) t AS val | LATERAL VIEW explode(arr) t AS val | Identical syntax |
DISTRIBUTE BY col | DISTRIBUTE BY col | Identical syntax |
SORT BY col | SORT BY col | Identical (within-partition sort) |
CLUSTER BY col | CLUSTER BY col | Identical |
INSERT OVERWRITE TABLE t SELECT ... | INSERT OVERWRITE TABLE t SELECT ... | Identical |
TABLESAMPLE (10 PERCENT) | TABLESAMPLE (10 PERCENT) | Identical |
ADD JAR hdfs:///lib/my-udf.jar | N/A — use cluster libraries | See UDF section |
SET hive.exec.parallel=true | N/A — Spark parallelizes by default | Remove SET statements |
SET hive.exec.dynamic.partition.mode=nonstrict | N/A — Spark allows dynamic partitions by default | Remove |
SET mapreduce.job.reduces=10 | spark.conf.set("spark.sql.shuffle.partitions", "10") | Different config key |
Functions with different behavior¶
| Function | HiveQL behavior | SparkSQL behavior | Migration action |
|---|---|---|---|
CAST(x AS INT) | Returns NULL on failure | Returns NULL on failure | No change |
unix_timestamp() | Current time | Current time | No change |
from_unixtime(ts, fmt) | Java SimpleDateFormat | Java SimpleDateFormat | No change |
regexp_extract(s, p, i) | Java regex | Java regex | No change |
percentile(col, 0.5) | Exact percentile (UDAF) | Use percentile_approx or percentile | Verify precision needs |
collect_set(col) | Returns array | Returns array | No change |
NVL(a, b) | Null-safe value | Use COALESCE(a, b) | Replace NVL with COALESCE |
IF(cond, true_val, false_val) | Conditional | Identical | No change |
Worked before/after examples¶
Example 1: Daily aggregation job
-- BEFORE: HiveQL
SET hive.exec.parallel=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=10000;
INSERT OVERWRITE TABLE analytics.daily_orders
PARTITION (order_date)
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount,
order_date
FROM raw.orders
WHERE order_date >= '${hiveconf:start_date}'
GROUP BY customer_id, order_date;
-- AFTER: SparkSQL (Databricks)
INSERT OVERWRITE TABLE silver.daily_orders
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount,
order_date
FROM bronze.orders
WHERE order_date >= :start_date
GROUP BY customer_id, order_date;
Example 2: Complex ETL with multiple joins
-- BEFORE: HiveQL
CREATE TABLE IF NOT EXISTS analytics.customer_360
STORED AS ORC
AS
SELECT
c.customer_id,
c.name,
c.segment,
o.order_count,
o.total_revenue,
p.last_payment_date,
COALESCE(s.support_tickets, 0) AS support_tickets,
CASE
WHEN o.total_revenue > 10000 THEN 'high'
WHEN o.total_revenue > 1000 THEN 'medium'
ELSE 'low'
END AS value_tier
FROM raw.customers c
LEFT JOIN (
SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total_revenue
FROM raw.orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id
LEFT JOIN (
SELECT customer_id, MAX(payment_date) AS last_payment_date
FROM raw.payments
GROUP BY customer_id
) p ON c.customer_id = p.customer_id
LEFT JOIN (
SELECT customer_id, COUNT(*) AS support_tickets
FROM raw.support
GROUP BY customer_id
) s ON c.customer_id = s.customer_id;
-- AFTER: SparkSQL / Delta Lake
CREATE OR REPLACE TABLE gold.customer_360
USING DELTA
AS
SELECT
c.customer_id,
c.name,
c.segment,
o.order_count,
o.total_revenue,
p.last_payment_date,
COALESCE(s.support_tickets, 0) AS support_tickets,
CASE
WHEN o.total_revenue > 10000 THEN 'high'
WHEN o.total_revenue > 1000 THEN 'medium'
ELSE 'low'
END AS value_tier
FROM silver.customers c
LEFT JOIN (
SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total_revenue
FROM silver.orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id
LEFT JOIN (
SELECT customer_id, MAX(payment_date) AS last_payment_date
FROM silver.payments
GROUP BY customer_id
) p ON c.customer_id = p.customer_id
LEFT JOIN (
SELECT customer_id, COUNT(*) AS support_tickets
FROM silver.support
GROUP BY customer_id
) s ON c.customer_id = s.customer_id;
Note: The SQL is nearly identical. Changes are: STORED AS ORC becomes USING DELTA, CREATE TABLE IF NOT EXISTS becomes CREATE OR REPLACE TABLE, and table references move from raw.* to the medallion architecture (silver.*, gold.*).
3. Managed vs external tables: migration to Delta¶
Hive table types¶
| Type | Data lifecycle | Metadata lifecycle | HDFS path |
|---|---|---|---|
| Managed (internal) | Deleted when table dropped | Deleted when table dropped | /user/hive/warehouse/db/table/ |
| External | Survives table drop | Deleted when table dropped | Custom location |
Delta Lake table types (Databricks)¶
| Type | Data lifecycle | Catalog | Best for |
|---|---|---|---|
| Managed (Unity Catalog) | Deleted when table dropped | Unity Catalog manages location | New tables, governed data |
| External | Survives table drop | Unity Catalog tracks location | Migrated data, shared storage |
Recommendation: Migrate Hive external tables as Delta external tables pointing to ADLS Gen2. Migrate Hive managed tables as Delta managed tables in Unity Catalog. Use external tables for the initial migration (data stays at known ADLS paths), then convert to managed tables when governance is established.
4. Bucketing to Z-ORDER and liquid clustering¶
Hive bucketing¶
-- Hive bucketed table
CREATE TABLE orders_bucketed (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
order_date DATE
)
CLUSTERED BY (customer_id) INTO 256 BUCKETS
STORED AS ORC;
Delta Z-ORDER (replacement)¶
-- Delta table with Z-ORDER optimization
CREATE TABLE silver.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
order_date DATE
)
USING DELTA;
-- Run Z-ORDER after data load
OPTIMIZE silver.orders ZORDER BY (customer_id);
Delta liquid clustering (modern replacement)¶
-- Liquid clustering (auto-managed, no manual OPTIMIZE needed)
CREATE TABLE silver.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
order_date DATE
)
USING DELTA
CLUSTER BY (customer_id, order_date);
Liquid clustering replaces both Hive bucketing and manual Z-ORDER operations. It continuously optimizes data layout based on the clustering columns.
5. Hive UDFs to Spark UDFs¶
Identifying UDFs in your Hive environment¶
# List all registered UDFs
hive -e "SHOW FUNCTIONS;" | grep -v "^[a-z_]*$" # Filter out built-in functions
# Show UDF details
hive -e "DESCRIBE FUNCTION EXTENDED my_custom_udf;"
Common Hive UDF patterns and Spark equivalents¶
| Hive UDF pattern | Spark equivalent |
|---|---|
GenericUDF (Java) | Python UDF or Spark built-in function |
GenericUDAF (Java aggregation) | PySpark UDAF or pandas UDF |
GenericUDTF (table-generating) | explode(), posexplode(), or Python UDTF |
| Simple string manipulation | Spark built-in regexp_replace, substring, trim, etc. |
| Date manipulation | Spark built-in date_add, datediff, date_format, etc. |
| Custom hashing | Spark built-in sha2, md5, xxhash64 |
Migration example: Java UDF to Python UDF¶
// BEFORE: Hive Java UDF
package com.example.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class MaskSSN extends UDF {
public Text evaluate(Text input) {
if (input == null) return null;
String ssn = input.toString();
if (ssn.length() == 9) {
return new Text("***-**-" + ssn.substring(5));
}
return new Text("INVALID");
}
}
# AFTER: Spark Python UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def mask_ssn(ssn):
if ssn is None:
return None
if len(ssn) == 9:
return f"***-**-{ssn[5:]}"
return "INVALID"
# Usage
df = df.withColumn("masked_ssn", mask_ssn(df["ssn"]))
# BETTER: Use Spark built-in functions (no UDF overhead)
from pyspark.sql.functions import when, length, concat, lit, substring
df = df.withColumn(
"masked_ssn",
when(
length("ssn") == 9,
concat(lit("***-**-"), substring("ssn", 6, 4))
).otherwise(lit("INVALID"))
)
Best practice: Always check if a Spark built-in function can replace your UDF. Built-in functions use Catalyst optimization and Photon acceleration. Python UDFs run in a separate Python process with serialization overhead.
6. SerDe to Delta schema evolution¶
Hive SerDe pattern¶
-- Hive table with custom SerDe for JSON data
CREATE EXTERNAL TABLE raw.events (
event_id STRING,
event_type STRING,
payload STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION 'hdfs:///data/events/';
Delta Lake replacement¶
-- Delta table with native JSON support
-- Step 1: Read JSON and write as Delta
-- (in a Databricks notebook)
-- Read JSON files natively
CREATE TABLE bronze.events
USING DELTA
AS SELECT * FROM json.`abfss://raw@storage.dfs.core.windows.net/events/`;
-- Step 2: Schema evolution is automatic
-- New fields in JSON files are added automatically with:
ALTER TABLE bronze.events SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name');
Schema evolution comparison¶
| Feature | Hive SerDe | Delta Lake |
|---|---|---|
| Add column | ALTER TABLE ADD COLUMNS | ALTER TABLE ADD COLUMNS or automatic with mergeSchema |
| Rename column | Not supported (recreate table) | ALTER TABLE RENAME COLUMN (with column mapping) |
| Reorder columns | Not supported | ALTER TABLE ALTER COLUMN ... FIRST/AFTER |
| Drop column | Not supported (recreate table) | ALTER TABLE DROP COLUMN (with column mapping) |
| Type widening | Manual ALTER TABLE CHANGE | Automatic with delta.enableTypeWidening |
7. Hive partitioning to Delta partitioning¶
Hive static and dynamic partitioning¶
-- Static partition insert
INSERT INTO TABLE orders PARTITION (year=2025, month=04)
SELECT order_id, customer_id, amount FROM staging_orders;
-- Dynamic partition insert
INSERT INTO TABLE orders PARTITION (year, month)
SELECT order_id, customer_id, amount, year, month FROM staging_orders;
Delta partitioning (preserved)¶
-- Delta supports the same partition semantics
INSERT INTO silver.orders
SELECT order_id, customer_id, amount, year, month FROM bronze.orders;
-- Delta automatically writes to partition directories based on partitionBy columns
When to change partitioning strategy¶
| Hive pattern | Delta recommendation | Reason |
|---|---|---|
PARTITIONED BY (year, month, day) | Keep if > 1 GB per partition | Works well at scale |
PARTITIONED BY (year, month, day, hour) | Reduce to (year, month) + Z-ORDER on day | Too many small partitions |
PARTITIONED BY (customer_id) | Use liquid clustering instead | High-cardinality partitions |
PARTITIONED BY (region) | Keep if < 20 distinct values | Low cardinality is fine |
| No partitioning (small table) | No partitioning | Tables < 1 GB need no partitioning |
8. Modernizing to dbt¶
Why dbt instead of raw SparkSQL scripts?¶
| Capability | Hive scripts | Raw SparkSQL | dbt |
|---|---|---|---|
| Version control | Manual file management | Manual | Built-in (dbt project = Git repo) |
| Testing | None | Manual assertions | Built-in schema.yml tests |
| Documentation | External wiki | External | Auto-generated from YAML + SQL comments |
| Lineage | Atlas (if configured) | None | Built-in DAG visualization |
| Incremental loads | Custom logic per script | Custom logic | incremental materialization |
| Environments | Different configs per cluster | Different configs | profiles.yml (dev/staging/prod) |
dbt model example (converted from Hive)¶
-- models/silver/daily_orders.sql
{{ config(
materialized='incremental',
unique_key='customer_id || order_date',
partition_by=['order_date'],
file_format='delta'
) }}
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount,
order_date
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY customer_id, order_date
# models/silver/schema.yml
models:
- name: daily_orders
description: "Daily order aggregates by customer"
columns:
- name: customer_id
description: "Unique customer identifier"
tests:
- not_null
- name: order_count
description: "Number of orders on this date"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1
- name: total_amount
description: "Sum of order amounts"
tests:
- not_null
- name: order_date
description: "Date of the orders"
tests:
- not_null
Common pitfalls¶
| Pitfall | Mitigation |
|---|---|
| Assuming HiveQL = SparkSQL | Test every query; pay attention to NULL handling, type coercion |
| Migrating UDFs without checking built-ins | Audit all UDFs; 60-80% can be replaced by Spark built-in functions |
| Keeping static partition inserts | Use Delta's automatic partitioning; simpler and less error-prone |
| Not testing metastore migration | Validate table count, column count, and partition count after migration |
| Ignoring Hive views | Views must be recreated manually; they do not export via DistCp |
| Skipping data quality checks | Run dbt tests on migrated data before decommissioning Hive |
Related¶
- Tutorial: Hive to dbt — step-by-step tutorial
- HDFS Migration — storage migration (prerequisite)
- Feature Mapping — all component mappings
- Migration Hub — full migration center
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team Related: Tutorial: Hive to dbt | HDFS Migration | Migration Hub