Home > Best Practices > Warehouse Setup & Configuration
🏢 Warehouse Setup & Configuration¶
Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team
📑 Table of Contents¶
- Overview
- Warehouse vs Lakehouse SQL Endpoint
- Schema Design
- Data Type Optimization
- Statistics Management
- Query Optimization
- Ingestion Best Practices
- V-Order in Warehouse
- Performance Monitoring
- Concurrency and Locking
- Warehouse Design Patterns
- Checklist
- Related Documents
📋 Overview¶
The Fabric Data Warehouse provides a full T-SQL experience with ACID transactions, ideal for traditional SQL workloads and enterprise analytics. This guide covers schema design, statistics management, query optimization, and best practices.
Creating a Warehouse¶
To create a new warehouse in Microsoft Fabric:
- Navigate to your workspace
- Click + New > Warehouse
- Enter a name following the naming conventions below
- Click Create

Source: Create a Warehouse in Microsoft Fabric
⚖️ Warehouse vs Lakehouse SQL Endpoint¶
| Capability | Warehouse | Lakehouse SQL Endpoint |
|---|---|---|
| DML Support | Full (INSERT, UPDATE, DELETE) | Read-only |
| DDL Support | Full | Limited (Views, TVFs) |
| Transactions | Yes | No |
| Storage | Delta (managed) | Delta (shared with Lakehouse) |
| Primary Use | Data warehousing | Analytics on Lakehouse data |
| Developer Persona | SQL Developer | Data Engineer/Analyst |
Recommendation: Use Warehouse for traditional SQL workloads with transactions. Use Lakehouse SQL Endpoint for analytical queries on engineering data.
🏗️ Schema Design¶
Star Schema Pattern¶
erDiagram
fact_daily_slot_performance ||--o{ dim_machine : machine_id
fact_daily_slot_performance ||--o{ dim_date : date_key
fact_daily_slot_performance ||--o{ dim_player : player_id
fact_daily_slot_performance ||--o{ dim_location : location_id
fact_daily_slot_performance {
bigint performance_id PK
int date_key FK
int machine_id FK
int player_id FK
int location_id FK
decimal total_bets
decimal total_wins
int session_count
decimal theoretical_hold
}
dim_machine {
int machine_id PK
string machine_serial
string machine_type
string manufacturer
date install_date
}
dim_date {
int date_key PK
date full_date
int year
int quarter
int month
int day_of_week
} Naming Conventions¶
| Object Type | Pattern | Example |
|---|---|---|
| Fact tables | fact_{subject} | fact_daily_slot_performance |
| Dimension tables | dim_{entity} | dim_machine, dim_player |
| Staging tables | stg_{source}_{entity} | stg_bronze_players |
| Views | vw_{purpose} | vw_active_players |
🔢 Data Type Optimization¶
Best Practices¶
- Use smallest appropriate data type
- Integer types for whole numbers
- Match types for joins
- Appropriate precision for decimals
Data Type Selection Guide¶
| Data Type | Use For | Avoid For |
|---|---|---|
INT | IDs, counts < 2.1B | Large IDs |
BIGINT | Large IDs, large counts | Small numbers |
DECIMAL(p,s) | Currency, precise numbers | Simple integers |
VARCHAR(n) | Variable text | Fixed-length codes |
DATE | Dates without time | Timestamps |
DATETIME2(n) | Timestamps | Date-only values |
Example Optimizations¶
-- Bad: Over-provisioned types
CREATE TABLE players_bad (
player_id VARCHAR(500), -- Too large
balance DECIMAL(38, 10), -- Excessive precision
last_login VARCHAR(100) -- Should be datetime
);
-- Good: Right-sized types
CREATE TABLE players_good (
player_id INT, -- Appropriate for ID
balance DECIMAL(18, 2), -- Currency precision
last_login DATETIME2(0) -- Proper type, second precision
);
📊 Statistics Management¶
Types of Statistics¶
| Type | Name Pattern | Purpose |
|---|---|---|
| Histogram | _WA_Sys_* | Column distribution |
| Average Column Length | ACE-AverageColumnLength_* | VARCHAR sizing |
| Table Cardinality | ACE-Cardinality | Row count estimates |
Automatic Statistics¶
Statistics are automatically created for: - Columns in GROUP BY - JOIN predicates - WHERE clause filters - ORDER BY columns
Manual Statistics¶
-- Create statistics on specific column
CREATE STATISTICS stat_player_id
ON fact_daily_slot_performance (player_id);
-- Create with full scan for accuracy
CREATE STATISTICS stat_machine_id
ON fact_daily_slot_performance (machine_id)
WITH FULLSCAN;
-- Update statistics after large data loads
UPDATE STATISTICS fact_daily_slot_performance;
-- Update specific statistics
UPDATE STATISTICS fact_daily_slot_performance (stat_player_id)
WITH FULLSCAN;
Statistics Best Practices¶
- Update after large loads - Statistics become stale
- Use FULLSCAN for critical columns - More accurate than sampling
- Monitor auto-generated statistics - Check for accuracy
- Create statistics proactively - Before heavy query workloads
Viewing Statistics¶
-- List all statistics on a table
SELECT
s.name AS statistics_name,
c.name AS column_name,
s.auto_created,
STATS_DATE(s.object_id, s.stats_id) AS last_updated
FROM sys.stats s
JOIN sys.stats_columns sc ON s.stats_id = sc.stats_id AND s.object_id = sc.object_id
JOIN sys.columns c ON sc.column_id = c.column_id AND sc.object_id = c.object_id
WHERE s.object_id = OBJECT_ID('fact_daily_slot_performance');
-- View statistics details
DBCC SHOW_STATISTICS ('fact_daily_slot_performance', '_WA_Sys_00000001');
🔍 Query Optimization¶
Filter Early¶
-- Good: Filter before join
SELECT f.*, d.full_date
FROM fact_daily_slot_performance f
JOIN dim_date d ON f.date_key = d.date_key
WHERE f.date_key >= 20240101 -- Filter on fact table
AND f.total_bets > 1000;
-- Less optimal: Filter after join
SELECT f.*, d.full_date
FROM fact_daily_slot_performance f
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.full_date >= '2024-01-01'; -- Filter on dimension
Select Only Needed Columns¶
-- Good: Specific columns
SELECT machine_id, SUM(total_bets), SUM(total_wins)
FROM fact_daily_slot_performance
GROUP BY machine_id;
-- Bad: SELECT *
SELECT *
FROM fact_daily_slot_performance; -- Avoid in production
Use Appropriate JOINs¶
-- Ensure data type parity in joins
SELECT f.*, m.machine_type
FROM fact_daily_slot_performance f
JOIN dim_machine m ON f.machine_id = m.machine_id -- Both INT
-- Avoid: ON CAST(f.machine_id AS VARCHAR) = m.machine_id
📥 Ingestion Best Practices¶
Loading Data¶
-- Use COPY INTO for bulk loads
COPY INTO dbo.fact_daily_slot_performance
FROM 'https://storage.blob.core.windows.net/data/*.parquet'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Managed Identity')
);
Transaction Management¶
-- Keep transactions short
BEGIN TRANSACTION;
-- Perform quick operations
INSERT INTO fact_daily_slot_performance
SELECT * FROM stg_daily_performance
WHERE load_date = CAST(GETDATE() AS DATE);
COMMIT TRANSACTION;
Post-Load Statistics Update¶
🚀 V-Order in Warehouse¶
Automatic V-Order¶
All data ingested into Warehouse is automatically V-Order optimized.
Benefits: - Faster reads across Fabric engines - Optimized for Power BI Direct Lake - Better compression
V-Order Compatibility¶
| Operation | V-Order Applied |
|---|---|
| INSERT | Yes |
| COPY INTO | Yes |
| Pipeline Copy | Yes |
| Dataflow | Yes |
📈 Performance Monitoring¶
Query Monitoring Views¶
-- Active queries
SELECT
session_id,
command,
status,
total_elapsed_time,
cpu_time,
logical_reads
FROM sys.dm_exec_requests
WHERE status = 'running';
-- Query history (use DMVs)
SELECT TOP 100
query_hash,
total_elapsed_time,
execution_count,
total_worker_time,
total_logical_reads
FROM sys.dm_exec_query_stats
ORDER BY total_elapsed_time DESC;
Lock Monitoring¶
-- Current locks
SELECT
request_session_id,
resource_type,
resource_description,
request_mode,
request_status
FROM sys.dm_tran_locks
WHERE resource_database_id = DB_ID();
🔒 Concurrency and Locking¶
Best Practices¶
- Keep transactions short - Minimize lock duration
- Avoid long-running reads during writes - Can cause blocking
- Use retry logic - Handle transient conflicts
Retry Pattern¶
import time
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, max=60), stop=stop_after_attempt(5))
def execute_query(connection, query):
cursor = connection.cursor()
cursor.execute(query)
return cursor.fetchall()
🧩 Warehouse Design Patterns¶
Slowly Changing Dimensions (SCD)¶
Type 1 - Overwrite:
UPDATE dim_player
SET email = source.email,
phone = source.phone
FROM dim_player d
JOIN stg_player_updates source ON d.player_id = source.player_id;
Type 2 - Historical:
-- End current record
UPDATE dim_player
SET is_current = 0, end_date = GETDATE()
WHERE player_id IN (SELECT player_id FROM stg_player_changes)
AND is_current = 1;
-- Insert new record
INSERT INTO dim_player (player_id, email, phone, start_date, is_current)
SELECT player_id, email, phone, GETDATE(), 1
FROM stg_player_changes;
Incremental Loading¶
-- MERGE for upsert
MERGE INTO fact_daily_slot_performance AS target
USING stg_daily_performance AS source
ON target.performance_id = source.performance_id
WHEN MATCHED THEN
UPDATE SET
total_bets = source.total_bets,
total_wins = source.total_wins,
updated_at = GETDATE()
WHEN NOT MATCHED THEN
INSERT (performance_id, date_key, machine_id, total_bets, total_wins)
VALUES (source.performance_id, source.date_key, source.machine_id,
source.total_bets, source.total_wins);
✅ Checklist¶
Warehouse UI Reference¶
Use the Fabric workspace to manage your warehouse. Access settings via the workspace settings cog:

Source: Microsoft Fabric Workspaces Documentation
Initial Setup¶
- Design star schema for analytics
- Define naming conventions
- Plan data type standards
- Identify key statistics columns
- Set up monitoring queries
Ongoing Operations¶
- Update statistics after large loads
- Monitor query performance
- Review lock contention
- Optimize slow queries
- Archive historical data
Performance¶
- Use appropriate data types
- Create statistics on filter/join columns
- Select only needed columns
- Filter early in queries
- Keep transactions short
🔗 Related Documents¶
| Document | Description |
|---|---|
| Lakehouse Setup & Organization | Lakehouse architecture and comparison |
| Spark & Notebooks | Spark-based data processing |
| Decision Guide | Warehouse vs Lakehouse decision criteria |
| Fabric SQL Database | Fabric SQL Database feature documentation |
| Direct Lake | Direct Lake connectivity for Power BI |