Azure Analytics Glossary¶

Comparative positioning note

This document is written from the perspective of Microsoft Azure, Cloud Scale Analytics, and CSA Loom. Any description of third-party or competing products, services, pricing, or capabilities is derived from publicly available documentation and sources believed accurate at the time of writing, and is provided for general comparison only. We do not claim expertise in, or authority over, any non-Microsoft product or service; the respective vendor's official documentation is the authoritative source for their offerings, which may change over time. Nothing here is intended to disparage any vendor — where a competing product has genuine advantages, we aim to note them honestly. Verify all third-party details against the vendor's current official documentation before making decisions.

🏠 Home > 📖 Reference > 📚 Glossary

📚 Terminology Reference Comprehensive glossary of Azure analytics terms, acronyms, and concepts.

A | B | C | D | E | F | G | H | I | J | K | L | M
N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A¶

ACID¶

Atomicity, Consistency, Isolation, Durability Properties that guarantee database transactions are processed reliably. Delta Lake provides ACID guarantees for data lakes.

Related: Delta Lake Guide

ADF¶

Azure Data Factory Cloud-based data integration service for creating, scheduling, and orchestrating data workflows.

Related: Azure Data Factory Integration

ADLS¶

Azure Data Lake Storage Scalable and secure data lake for high-performance analytics workloads. ADLS Gen2 combines the capabilities of ADLS Gen1 and Azure Blob Storage.

Related: Architecture Overview

Apache Spark¶

Open-source distributed computing system for big data processing. Synapse Spark pools run Apache Spark workloads.

Related: Spark Performance

Auto Loader¶

Delta Lake feature for incrementally and efficiently processing new data files as they arrive in cloud storage.

Related: Auto Loader Tutorial

Azure Active Directory (Azure AD)¶

Microsoft's cloud-based identity and access management service. Now known as Microsoft Entra ID.

Related: Security Best Practices

Azure Purview¶

Unified data governance service that helps manage and govern on-premises, multi-cloud, and SaaS data. Now part of Microsoft Purview.

Related: Azure Purview Integration

Azure Synapse Analytics¶

Unified analytics service that brings together enterprise data warehousing and big data analytics.

Related: Platform Overview

B¶

Batch Processing¶

Processing large volumes of data collected over a period of time. Contrasts with stream processing.

Related: Pipeline Optimization

Broadcast Join¶

Spark optimization technique where smaller datasets are broadcasted to all executors to avoid shuffling large datasets.

Related: Spark Performance

Built-in Serverless Pool¶

Pre-configured serverless SQL pool included with every Synapse workspace at no additional cost.

Related: Serverless SQL Overview

C¶

CDC¶

Change Data Capture Process of identifying and capturing changes made to data in a database, typically for replication or synchronization.

Related: CDC Tutorial

CETAS¶

CREATE EXTERNAL TABLE AS SELECT T-SQL command in serverless SQL pool to create external tables and export query results to storage.

Related: Serverless SQL Best Practices

Columnar Storage¶

Data storage format that stores data tables by column rather than by row. Examples: Parquet, ORC.

Related: Performance Optimization

Compute Node¶

Individual server in a distributed computing cluster that performs data processing tasks.

Related: Spark Configuration

Concurrency¶

Number of simultaneous operations or queries that can run at the same time.

Related: Performance Optimization

Copy Activity¶

Azure Data Factory activity used to copy data from source to destination with various transformations.

Related: Azure Data Factory Integration

D¶

Data Distribution¶

Strategy for spreading data across compute nodes in a distributed system. Types include hash, round-robin, and replicate.

Related: SQL Performance

Data Flow¶

Visual data transformation tool in Azure Data Factory and Synapse for building ETL logic without coding.

Related: Integration Guide

Data Lake¶

Storage repository that holds vast amounts of raw data in its native format until needed.

Related: Delta Lakehouse Architecture

Data Lakehouse¶

Architecture that combines the best features of data lakes and data warehouses.

Related: Delta Lakehouse Overview

Data Partitioning¶

Dividing large datasets into smaller, manageable pieces based on specific criteria (e.g., date, region).

Related: Delta Lake Optimization

Data Skew¶

Uneven distribution of data across partitions, causing some nodes to process more data than others.

Related: Spark Performance

Data Warehouse Unit (DWU)¶

Measure of compute resources (CPU, memory, I/O) allocated to a dedicated SQL pool.

Related: Performance Optimization

Dedicated SQL Pool¶

Provisioned resource offering enterprise-scale data warehousing capabilities with guaranteed resources.

Related: Architecture Overview

Delta Lake¶

Open-source storage layer that brings ACID transactions to data lakes.

Related: Delta Lake Guide

Delta Table¶

Table format in Delta Lake that supports ACID transactions, schema enforcement, and time travel.

Related: Table Optimization

DIU¶

Data Integration Unit Measure of compute power in Azure Data Factory representing a combination of CPU, memory, and network resources.

Related: Pipeline Optimization

Driver¶

Master process in Apache Spark that coordinates and schedules work across executors.

Related: Spark Configuration

DQS¶

Data Quality Services Legacy SQL Server feature (introduced in SQL Server 2012) for cleansing, matching, and standardizing reference data using a knowledge base of business rules. DQS provides a Data Quality Server that hosts knowledge bases and a Data Quality Client used by data stewards to author cleansing and matching projects.

Why it matters here: DQS does not ship with Azure SQL Database, Azure SQL Managed Instance, Synapse, or Fabric. Customers migrating from on-prem SQL Server need a replacement path. In Cloud Scale Analytics, the canonical replacement is a combination of:

Microsoft Purview Data Quality (formerly Azure Purview DQ) for catalog-integrated quality rules, scoring, and lineage
dbt tests + Great Expectations for in-pipeline assertion-based quality gates (ADR-0013 — dbt as canonical transformation)
CSA Loom data-product editor for surface-level Purview-backed quality scores when running inside an Azure Government tenant where Fabric isn't available

A third-party option used by some federal customers is Informatica IDQ or Ataccama ONE.

DW Unit (DWU)¶

See Data Warehouse Unit.

E¶

ETL¶

Extract, Transform, Load Traditional data integration process that extracts data from sources, transforms it, then loads into destination.

Related: Integration Guide

ELT¶

Extract, Load, Transform Modern approach that loads raw data first, then transforms it in the destination system.

Related: Delta Lakehouse Architecture

Executor¶

Worker process in Apache Spark that runs tasks and stores data for the application.

Related: Spark Configuration

External Table¶

Table definition that references data stored outside the database, typically in a data lake.

Related: Serverless SQL Guide

F¶

Fault Tolerance¶

System's ability to continue operating properly in the event of failures.

Related: Best Practices

File Format¶

Structure in which data is stored. Common formats: Parquet, CSV, JSON, ORC, Avro.

Related: Serverless SQL Best Practices

Firewall Rule¶

Network security rule that controls incoming and outgoing traffic to Azure resources.

Related: Network Security

G¶

Graph Database¶

Database designed to treat relationships between data as equally important as the data itself.

Related: Architecture Patterns

H¶

Hive Metastore¶

Central repository of metadata for Hadoop, used by Spark to store table schemas and partition information.

Related: Shared Metadata

Hot Path¶

Real-time data processing path for immediate insights. Contrasts with cold path (batch processing).

Related: Real-time Analytics

I¶

Idempotent¶

Operation that produces the same result regardless of how many times it's executed.

Related: Pipeline Best Practices

Indexing¶

Database optimization technique that improves query performance by creating efficient data lookup structures.

Related: SQL Performance

Integration Runtime¶

Compute infrastructure used by Azure Data Factory to provide data integration across different network environments.

Related: Azure Data Factory Integration

J¶

JSON¶

JavaScript Object Notation Lightweight data interchange format that is easy to read and write.

Related: Serverless SQL Guide

K¶

Key Vault¶

Azure service for securely storing and accessing secrets, keys, and certificates.

Related: Security Best Practices

L¶

Lakehouse¶

See Data Lakehouse.

Lazy Evaluation¶

Execution model where transformations are not executed until an action is called. Used in Apache Spark.

Related: Spark Performance

Lineage¶

Tracking of data's origin, transformations, and movement through systems.

Related: Azure Purview Integration

Linked Service¶

Connection definition to external data sources or compute resources in Azure Synapse or Data Factory.

Related: Integration Guide

M¶

Managed Identity¶

Azure AD identity managed by Azure, eliminating the need for credentials in code.

Related: Security Best Practices

Managed Private Endpoint¶

Private endpoint managed by Azure Synapse for secure connectivity to Azure services.

Related: Private Link Architecture

Mapping Data Flow¶

Code-free data transformation feature in Azure Data Factory and Synapse.

Related: Integration Guide

Medallion Architecture¶

Data architecture pattern with bronze (raw), silver (cleaned), and gold (aggregated) layers.

Related: Delta Lakehouse Architecture

Merge Operation¶

Upsert operation (update if exists, insert if not) supported by Delta Lake.

Related: CDC Tutorial

Metadata¶

Data that provides information about other data (e.g., schema, statistics, lineage).

Related: Shared Metadata

MPP¶

Massively Parallel Processing Architecture that uses many processors working in parallel to quickly execute large-scale data operations.

Related: Architecture Overview

N¶

Notebook¶

Interactive document combining code, visualizations, and narrative text. Synapse supports Spark notebooks.

Related: PySpark Fundamentals

NSG¶

Network Security Group Azure firewall containing security rules to filter network traffic.

Related: Network Security

O¶

OPENROWSET¶

T-SQL function in serverless SQL pool for querying files in data lakes without creating external tables.

Related: Serverless SQL Guide

Optimize¶

Delta Lake command to compact small files into larger ones for better query performance.

Related: Table Optimization

ORC¶

Optimized Row Columnar Columnar storage file format optimized for Hadoop workloads.

Related: Performance Optimization

P¶

Parquet¶

Open-source columnar storage format designed for efficient data storage and retrieval.

Related: Serverless SQL Guide

Partition¶

Logical division of a large dataset for improved query performance and manageability.

Related: Delta Lake Optimization

Pipeline¶

Workflow that orchestrates data movement and transformation activities.

Related: Pipeline Optimization

PolyBase¶

Data virtualization feature for querying external data sources using T-SQL.

Related: SQL Performance

Private Endpoint¶

Network interface that connects privately and securely to Azure services using Azure Private Link.

Related: Private Link Architecture

PySpark¶

Python API for Apache Spark, enabling Spark programming using Python.

Related: PySpark Fundamentals

Q¶

Query Optimization¶

Process of improving query performance through various techniques like indexing, statistics, and query rewriting.

Related: Query Optimization

R¶

RBAC¶

Role-Based Access Control Authorization system for managing who has access to Azure resources and what they can do.

Related: Security Best Practices

RDD¶

Resilient Distributed Dataset Fundamental data structure in Apache Spark representing an immutable distributed collection.

Related: Spark Performance

Resource Group¶

Container that holds related resources for an Azure solution.

Related: Architecture Overview

S¶

Schema Evolution¶

Ability to handle changes in data schema over time without breaking existing queries.

Related: Delta Lake Guide

Schema on Read¶

Approach where data schema is applied when data is read, not when it's written. Used in data lakes.

Related: Serverless SQL Guide

Serverless SQL Pool¶

On-demand SQL query service with pay-per-query pricing model. No infrastructure to manage.

Related: Serverless SQL Overview

Service Principal¶

Identity created for use with applications, services, and automation tools to access Azure resources.

Related: Security Best Practices

Shuffle¶

Expensive operation in Spark where data is redistributed across partitions.

Related: Spark Performance

SLA¶

Service Level Agreement Commitment between service provider and customer regarding performance and availability.

Related: Best Practices

Slowly Changing Dimension (SCD)¶

Dimension that changes slowly over time rather than changing on regular schedule. Types include SCD Type 1, 2, 3.

Related: CDC Tutorial

Spark Pool¶

Managed Apache Spark cluster in Azure Synapse Analytics.

Related: Spark Configuration

SQL Pool¶

Collective term for both dedicated SQL pools and serverless SQL pools in Synapse.

Related: Architecture Overview

Statistics¶

Metadata about data distribution that helps query optimizer create efficient execution plans.

Related: SQL Performance

Storage Account¶

Azure resource that provides cloud storage for data objects including blobs, files, queues, and tables.

Related: Architecture Overview

Streaming¶

Continuous processing of data in real-time as it arrives.

Related: Real-time Analytics

Synapse Studio¶

Web-based integrated development environment for Azure Synapse Analytics.

Related: Environment Setup

Synapse Workspace¶

Collaborative environment for cloud-based enterprise analytics in Azure.

Related: Platform Overview

T¶

Table Distribution¶

Strategy for spreading table data across compute nodes. Types: hash, round-robin, replicated.

Related: SQL Performance

Time Travel¶

Delta Lake feature allowing queries of historical versions of data.

Related: Delta Lake Guide

Transformation¶

Operation that modifies data from source format to desired destination format.

Related: Integration Guide

Trigger¶

Automation that determines when a pipeline should run (scheduled, tumbling window, event-based).

Related: Pipeline Optimization

U¶

Upsert¶

Combination of update and insert operations. Updates existing records or inserts new ones if they don't exist.

Related: CDC Tutorial

V¶

Vacuum¶

Delta Lake command to remove old data files that are no longer referenced.

Related: Table Optimization

VNet¶

Virtual Network Isolated network in Azure that enables Azure resources to securely communicate with each other.

Related: Network Security

VNet Integration¶

Connecting Azure services to a virtual network for enhanced security and isolation.

Related: Private Link Architecture

W¶

Watermark¶

Marker used in incremental data loading to track which data has been processed.

Related: Pipeline Optimization

Workspace¶

See Synapse Workspace.

X¶

XML¶

Extensible Markup Language Markup language for encoding documents in a format that is both human-readable and machine-readable.

Y¶

YARN¶

Yet Another Resource Negotiator Resource management layer in Hadoop ecosystem. Not directly used in Synapse but relevant for understanding Spark.

Z¶

Z-Order¶

Delta Lake optimization technique that co-locates related information in the same set of files for faster queries.

Related: Table Optimization

Zone Redundancy¶

Azure storage redundancy option that replicates data across availability zones.

Related: Best Practices

Acronym Quick Reference¶

Acronym	Full Term	Category
ACID	Atomicity, Consistency, Isolation, Durability	Database
ADF	Azure Data Factory	Service
ADLS	Azure Data Lake Storage	Service
CDC	Change Data Capture	Technique
CETAS	CREATE EXTERNAL TABLE AS SELECT	SQL
DIU	Data Integration Unit	Performance
DWU	Data Warehouse Unit	Performance
ELT	Extract, Load, Transform	Pattern
ETL	Extract, Transform, Load	Pattern
MPP	Massively Parallel Processing	Architecture
NSG	Network Security Group	Security
ORC	Optimized Row Columnar	File Format
RBAC	Role-Based Access Control	Security
RDD	Resilient Distributed Dataset	Spark
SCD	Slowly Changing Dimension	Data Warehouse
SLA	Service Level Agreement	Operations
VNet	Virtual Network	Networking
YARN	Yet Another Resource Negotiator	Hadoop

Resource	Description
Architecture Overview	Architectural concepts and patterns
Best Practices	Implementation best practices
Tutorials	Hands-on learning materials
Code Examples	Practical code samples
FAQ	Frequently asked questions

💡 Tip: Use Ctrl+F (or Cmd+F on Mac) to quickly search for specific terms on this page.

Last Updated: January 2025

Azure Analytics Glossary¶

Navigation¶

A¶

ACID¶

ADF¶

ADLS¶

Apache Spark¶

Auto Loader¶

Azure Active Directory (Azure AD)¶

Azure Purview¶

Azure Synapse Analytics¶

B¶

Batch Processing¶

Broadcast Join¶

Built-in Serverless Pool¶

C¶

CDC¶

CETAS¶

Columnar Storage¶

Compute Node¶

Concurrency¶

Copy Activity¶

D¶

Data Distribution¶

Data Flow¶

Data Lake¶

Data Lakehouse¶

Data Partitioning¶

Data Skew¶

Data Warehouse Unit (DWU)¶

Dedicated SQL Pool¶

Delta Lake¶

Delta Table¶

DIU¶

Driver¶

DQS¶

DW Unit (DWU)¶

E¶

ETL¶

ELT¶

Executor¶

External Table¶

F¶

Fault Tolerance¶

File Format¶

Firewall Rule¶

G¶

Graph Database¶

H¶

Hive Metastore¶

Hot Path¶

I¶

Idempotent¶

Indexing¶

Integration Runtime¶

J¶

JSON¶

K¶

Key Vault¶

L¶

Lakehouse¶

Lazy Evaluation¶

Lineage¶

Linked Service¶

M¶

Managed Identity¶

Managed Private Endpoint¶

Mapping Data Flow¶

Medallion Architecture¶

Merge Operation¶

Metadata¶

MPP¶

N¶

Notebook¶

NSG¶

O¶

OPENROWSET¶

Optimize¶

ORC¶

P¶