Impala Migration: Impala to Databricks SQL¶
A detailed guide for migrating Apache Impala workloads to Databricks SQL, including SQL dialect conversion, Kudu-to-Delta migration, performance tuning, and worked examples.
Overview¶
Impala is Cloudera's interactive SQL engine, designed for low-latency analytical queries on HDFS and Kudu data. It serves BI dashboards, ad-hoc analysis, and scheduled reporting workloads. On Azure, these workloads migrate to Databricks SQL Warehouses, which provide comparable or better interactive query performance through the Photon engine, adaptive query execution, and result caching.
The good news: Impala SQL is very close to Spark SQL / ANSI SQL. The migration is primarily a dialect conversion exercise with a few structural changes around metadata management and storage format.
SQL dialect comparison¶
Syntax that works identically¶
The following Impala SQL constructs work without modification on Databricks SQL:
SELECT,FROM,WHERE,GROUP BY,HAVING,ORDER BYJOIN(INNER, LEFT, RIGHT, FULL OUTER, CROSS)UNION,UNION ALL,INTERSECT,EXCEPTCASE WHEN ... THEN ... ELSE ... ENDWITH(Common Table Expressions)WINDOWfunctions (ROW_NUMBER,RANK,DENSE_RANK,LAG,LEAD,SUM OVER, etc.)LATERAL VIEW EXPLODEIN,EXISTS, subqueriesCAST,COALESCE,NULLIF,NVL- Most string functions:
CONCAT,SUBSTR,TRIM,UPPER,LOWER,LENGTH,REGEXP_REPLACE - Most date functions:
YEAR,MONTH,DAY,DATE_ADD,DATE_SUB,DATEDIFF - Most aggregate functions:
COUNT,SUM,AVG,MIN,MAX,COUNT(DISTINCT ...)
Syntax that requires conversion¶
| Impala SQL | Databricks SQL | Notes |
|---|---|---|
COMPUTE STATS table_name | ANALYZE TABLE table_name COMPUTE STATISTICS | Required for optimizer. Delta Lake also collects stats automatically. |
COMPUTE INCREMENTAL STATS | ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS | Delta stats are column-level by default. |
SHOW TABLE STATS table_name | DESCRIBE DETAIL table_name | Delta table metadata. |
SHOW COLUMN STATS table_name | DESCRIBE EXTENDED table_name column_name | Column-level statistics. |
INVALIDATE METADATA | Not needed | Delta and Unity Catalog maintain metadata automatically. |
REFRESH table_name | REFRESH TABLE table_name | Identical syntax; rarely needed with Delta. |
INSERT OVERWRITE ... PARTITION (col=val) | INSERT OVERWRITE ... PARTITION (col) | Dynamic partition overwrite mode. Or use MERGE INTO for incremental. |
CREATE TABLE ... STORED AS PARQUET | CREATE TABLE ... USING DELTA | Default to Delta for all new tables. |
CREATE TABLE ... STORED AS KUDU | CREATE TABLE ... USING DELTA | See Kudu migration section below. |
ALTER TABLE ... ADD PARTITION | Not needed for Delta | Delta manages partitions automatically. |
ALTER TABLE ... DROP PARTITION | DELETE FROM table WHERE partition_col = val | Delta supports row-level deletes. |
UPSERT INTO (Kudu) | MERGE INTO ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT | Delta MERGE is the equivalent. |
[SHUFFLE] / [NOSHUFFLE] hints | Not applicable | Databricks adaptive query execution handles this. |
STRAIGHT_JOIN hint | Not applicable | Optimizer handles join ordering. |
/* +NOCLUSTERED */ hint | Not applicable | Photon engine optimizes automatically. |
APPX_MEDIAN() | PERCENTILE_APPROX(col, 0.5) | Approximate median function. |
GROUP_CONCAT() | COLLECT_LIST() + CONCAT_WS() | Or ARRAY_JOIN(COLLECT_LIST(...), ','). |
NDV() (approximate distinct) | APPROX_COUNT_DISTINCT() | HyperLogLog approximate distinct count. |
EXTRACT(epoch FROM ts) | UNIX_TIMESTAMP(ts) | Epoch extraction. |
FROM_UNIXTIME(ts, 'yyyy-MM-dd') | FROM_UNIXTIME(ts, 'yyyy-MM-dd') | Identical. |
STRLEFT(s, n) / STRRIGHT(s, n) | LEFT(s, n) / RIGHT(s, n) | Standard SQL function names. |
TRUNC(ts, 'MONTH') | TRUNC(ts, 'MONTH') | Identical. |
Kudu to Delta Lake migration¶
Kudu is Impala's mutable storage engine, designed for fast inserts, updates, and deletes on analytical data. Delta Lake is the direct replacement.
Feature comparison¶
| Kudu feature | Delta Lake equivalent | Notes |
|---|---|---|
| Primary key | Not enforced; use MERGE INTO with key columns | Delta does not enforce primary keys but supports merge-on-key patterns. |
| INSERT / UPDATE / DELETE | INSERT / UPDATE / DELETE / MERGE INTO | Full DML support on Delta. |
| UPSERT | MERGE INTO ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT | Standard SQL MERGE. |
| Hash partitioning | Delta partitioning + Z-ordering | Z-ORDER BY provides skip-based optimization for high-cardinality columns. |
| Range partitioning | Delta partitioning (directory-based) | Use for low-cardinality columns (date, region). |
| Auto-compaction | Delta auto-optimize (optimizeWrite, autoCompact) | Enable on table properties. |
| Tablet servers | Managed by Databricks | No user-managed tablet infrastructure. |
| Replication factor | ADLS Gen2 redundancy (LRS/ZRS/GRS) | Storage-level redundancy; no application-level replication. |
Kudu table migration script¶
# Step 1: Read from Kudu (on CDH cluster, export to Parquet on HDFS)
# Run this on the CDH cluster
spark.read \
.format("org.apache.kudu.spark.kudu") \
.option("kudu.master", "kudu-master:7051") \
.option("kudu.table", "my_database.my_table") \
.load() \
.write.format("parquet") \
.mode("overwrite") \
.save("hdfs:///export/kudu/my_database/my_table")
# Step 2: Transfer Parquet files to ADLS Gen2 (azcopy or ADF)
# azcopy copy "hdfs-export-path" "abfss://bronze@storage.dfs.core.windows.net/kudu/my_table"
# Step 3: Create Delta table on Databricks
spark.read.parquet("abfss://bronze@storage.dfs.core.windows.net/kudu/my_table") \
.write.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.partitionBy("date_col") \
.saveAsTable("silver.my_database.my_table")
# Step 4: Enable auto-optimization
spark.sql("""
ALTER TABLE silver.my_database.my_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
)
""")
# Step 5: Apply Z-ordering for high-cardinality query columns
spark.sql("""
OPTIMIZE silver.my_database.my_table
ZORDER BY (customer_id, order_date)
""")
Impala partitioning to Delta partitioning¶
Decision framework¶
| Impala partition type | Cardinality | Delta recommendation |
|---|---|---|
| Date column (daily) | ~365/year | Partition by date column. Matches Impala behavior. |
| Date column (hourly) | ~8,760/year | Partition by date (daily), not hourly. Z-ORDER by hour if needed. |
| Region / country | 10-200 | Partition if < 100 values; otherwise Z-ORDER. |
| Customer ID | 10K-10M | Z-ORDER, do not partition. Too many partitions degrades performance. |
| Transaction ID | Very high | Z-ORDER. Never partition by high-cardinality columns. |
Example: convert Impala partitioned table¶
-- Impala (before)
CREATE TABLE sales.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(18,2),
status STRING
)
PARTITIONED BY (order_date DATE)
STORED AS PARQUET;
-- Databricks SQL (after)
CREATE TABLE silver.sales.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(18,2),
status STRING,
order_date DATE
)
USING DELTA
PARTITIONED BY (order_date);
-- Apply Z-ordering for customer_id queries
OPTIMIZE silver.sales.orders ZORDER BY (customer_id);
Performance tuning comparison¶
Impala performance knobs vs Databricks equivalents¶
| Impala tuning technique | Databricks equivalent | Notes |
|---|---|---|
COMPUTE STATS for all tables | ANALYZE TABLE ... COMPUTE STATISTICS | Delta also collects stats automatically on write. |
| Impala admission control | Databricks SQL Warehouse sizing (T-shirt sizes) | Choose warehouse size based on concurrency needs. |
| Impala memory limits per query | Databricks SQL Warehouse auto-manages memory | Photon engine manages memory allocation. |
| Impala partition pruning | Delta partition pruning + data skipping | Delta uses file-level min/max stats in addition to partition pruning. |
| Impala runtime filters | Databricks adaptive query execution (AQE) | AQE dynamically optimizes joins and shuffles. |
SET MEM_LIMIT=4g per query | Warehouse sizing | No per-query memory configuration. |
| Impala catalog caching | Unity Catalog + result caching | Enable result caching: SET use_cached_result = true. |
| Impala HDFS short-circuit reads | Not applicable | ADLS Gen2 reads are network-based; Photon compensates. |
| Impala codegen | Photon native vectorized engine | Photon is ~2-8x faster than non-vectorized execution. |
| Impala resource pools | SQL Warehouse scaling (min/max clusters) | Multi-cluster auto-scaling for concurrency. |
Common performance issues after migration¶
| Issue | Cause | Solution |
|---|---|---|
| Queries slower than Impala | Missing statistics | Run ANALYZE TABLE on migrated tables. |
| High shuffle in joins | Default shuffle partitions too low | SET spark.sql.shuffle.partitions = 200 (or higher for large tables). |
| Full table scans on partitioned data | Partition predicate not recognized | Use literal values in WHERE clause, not functions on partition columns. |
| Slow small queries | Warehouse startup time | Use Serverless SQL Warehouse (instant startup). |
| Concurrent query throttling | Single-cluster warehouse | Enable multi-cluster scaling (2-8 clusters). |
| OOM on large aggregations | Insufficient warehouse size | Scale up to larger warehouse size (Medium, Large, X-Large). |
Worked example: full query migration¶
Original Impala query¶
-- Impala: Daily revenue report with customer segmentation
COMPUTE INCREMENTAL STATS sales.orders;
COMPUTE INCREMENTAL STATS sales.customers;
SELECT
o.order_date,
c.segment,
COUNT(*) AS order_count,
SUM(o.amount) AS total_revenue,
NDV(o.customer_id) AS unique_customers,
APPX_MEDIAN(o.amount) AS median_order_value,
GROUP_CONCAT(DISTINCT o.status, ', ') AS statuses_seen
FROM sales.orders o
JOIN sales.customers c ON o.customer_id = c.customer_id
WHERE o.order_date BETWEEN '2025-01-01' AND '2025-12-31'
AND o.status != 'cancelled'
GROUP BY o.order_date, c.segment
HAVING SUM(o.amount) > 1000
ORDER BY o.order_date, total_revenue DESC;
Migrated Databricks SQL query¶
-- Databricks SQL: Daily revenue report with customer segmentation
-- Statistics are collected automatically on Delta tables, but you can force:
ANALYZE TABLE silver.sales.orders COMPUTE STATISTICS FOR ALL COLUMNS;
ANALYZE TABLE silver.sales.customers COMPUTE STATISTICS FOR ALL COLUMNS;
SELECT
o.order_date,
c.segment,
COUNT(*) AS order_count,
SUM(o.amount) AS total_revenue,
APPROX_COUNT_DISTINCT(o.customer_id) AS unique_customers,
PERCENTILE_APPROX(o.amount, 0.5) AS median_order_value,
CONCAT_WS(', ', COLLECT_SET(o.status)) AS statuses_seen
FROM silver.sales.orders o
JOIN silver.sales.customers c ON o.customer_id = c.customer_id
WHERE o.order_date BETWEEN '2025-01-01' AND '2025-12-31'
AND o.status != 'cancelled'
GROUP BY o.order_date, c.segment
HAVING SUM(o.amount) > 1000
ORDER BY o.order_date, total_revenue DESC;
Key changes annotated¶
| Line | Change | Reason |
|---|---|---|
COMPUTE INCREMENTAL STATS | ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS | Syntax difference; Delta auto-collects basic stats. |
sales.orders | silver.sales.orders | Three-level namespace: catalog.schema.table. |
NDV() | APPROX_COUNT_DISTINCT() | Spark SQL function name for HyperLogLog. |
APPX_MEDIAN() | PERCENTILE_APPROX(col, 0.5) | Spark SQL approximate median. |
GROUP_CONCAT(DISTINCT ...) | CONCAT_WS(', ', COLLECT_SET(...)) | Spark uses COLLECT_SET for distinct + CONCAT_WS for joining. |
Impala metadata caching vs Direct Lake / Databricks SQL caching¶
Impala relies on metadata caching in the Impala Catalog daemon (catalogd) to avoid repeated HMS lookups. On Databricks SQL, two caching mechanisms replace this:
Databricks SQL result caching¶
- Automatically caches query results for identical queries
- Cache invalidated when underlying Delta table changes
- No configuration required; enabled by default on SQL Warehouses
Databricks SQL disk caching¶
- Caches remote ADLS Gen2 data on local SSDs of warehouse nodes
- Reduces repeated reads from storage
- Enabled by default on SQL Warehouse worker nodes
Fabric Direct Lake (alternative target)¶
For organizations also using Microsoft Fabric, Direct Lake mode provides:
- Power BI reads Delta tables directly from OneLake -- no data import
- Sub-second query response for dashboards
- No separate caching layer to manage
Migration checklist for Impala workloads¶
- Inventory all Impala queries (scheduled reports, BI connections, ad-hoc)
- Export Kudu table schemas and data to Parquet
- Transfer Parquet data to ADLS Gen2
- Create Delta tables with appropriate partitioning
- Convert Impala SQL to Databricks SQL (use dialect table above)
- Replace
COMPUTE STATSwithANALYZE TABLE - Replace Impala-specific functions (NDV, APPX_MEDIAN, GROUP_CONCAT)
- Update JDBC/ODBC connection strings for BI tools
- Run
OPTIMIZE ... ZORDER BYon frequently queried columns - Enable auto-optimization on all Delta tables
- Benchmark query latency: Impala vs Databricks SQL
- Validate row counts and checksums between source and target
- Update monitoring dashboards to use Azure Monitor
- Train BI users on Databricks SQL Editor (minimal change from Impala shell)
Next steps¶
- Walk through the Impala to Databricks Tutorial for a hands-on migration exercise
- Review the Benchmarks for Impala vs Databricks SQL performance data
- See the Complete Feature Mapping for the full component comparison
Last updated: 2026-04-30 Maintainers: CSA-in-a-Box core team