Skip to content

Azure Analytics Glossary

🏠 Home > 📖 Reference > 📚 Glossary

📚 Terminology Reference Comprehensive glossary of Azure analytics terms, acronyms, and concepts.



A

ACID

Atomicity, Consistency, Isolation, Durability Properties that guarantee database transactions are processed reliably. Delta Lake provides ACID guarantees for data lakes.

Related: Delta Lake Guide

ADF

Azure Data Factory Cloud-based data integration service for creating, scheduling, and orchestrating data workflows.

Related: Azure Data Factory Integration

ADLS

Azure Data Lake Storage Scalable and secure data lake for high-performance analytics workloads. ADLS Gen2 combines the capabilities of ADLS Gen1 and Azure Blob Storage.

Related: Architecture Overview

Apache Spark

Open-source distributed computing system for big data processing. Synapse Spark pools run Apache Spark workloads.

Related: Spark Performance

Auto Loader

Delta Lake feature for incrementally and efficiently processing new data files as they arrive in cloud storage.

Related: Auto Loader Tutorial

Azure Active Directory (Azure AD)

Microsoft's cloud-based identity and access management service. Now known as Microsoft Entra ID.

Related: Security Best Practices

Azure Purview

Unified data governance service that helps manage and govern on-premises, multi-cloud, and SaaS data. Now part of Microsoft Purview.

Related: Azure Purview Integration

Azure Synapse Analytics

Unified analytics service that brings together enterprise data warehousing and big data analytics.

Related: Platform Overview


B

Batch Processing

Processing large volumes of data collected over a period of time. Contrasts with stream processing.

Related: Pipeline Optimization

Broadcast Join

Spark optimization technique where smaller datasets are broadcasted to all executors to avoid shuffling large datasets.

Related: Spark Performance

Built-in Serverless Pool

Pre-configured serverless SQL pool included with every Synapse workspace at no additional cost.

Related: Serverless SQL Overview


C

CDC

Change Data Capture Process of identifying and capturing changes made to data in a database, typically for replication or synchronization.

Related: CDC Tutorial

CETAS

CREATE EXTERNAL TABLE AS SELECT T-SQL command in serverless SQL pool to create external tables and export query results to storage.

Related: Serverless SQL Best Practices

Columnar Storage

Data storage format that stores data tables by column rather than by row. Examples: Parquet, ORC.

Related: Performance Optimization

Compute Node

Individual server in a distributed computing cluster that performs data processing tasks.

Related: Spark Configuration

Concurrency

Number of simultaneous operations or queries that can run at the same time.

Related: Performance Optimization

Copy Activity

Azure Data Factory activity used to copy data from source to destination with various transformations.

Related: Azure Data Factory Integration


D

Data Distribution

Strategy for spreading data across compute nodes in a distributed system. Types include hash, round-robin, and replicate.

Related: SQL Performance

Data Flow

Visual data transformation tool in Azure Data Factory and Synapse for building ETL logic without coding.

Related: Integration Guide

Data Lake

Storage repository that holds vast amounts of raw data in its native format until needed.

Related: Delta Lakehouse Architecture

Data Lakehouse

Architecture that combines the best features of data lakes and data warehouses.

Related: Delta Lakehouse Overview

Data Partitioning

Dividing large datasets into smaller, manageable pieces based on specific criteria (e.g., date, region).

Related: Delta Lake Optimization

Data Skew

Uneven distribution of data across partitions, causing some nodes to process more data than others.

Related: Spark Performance

Data Warehouse Unit (DWU)

Measure of compute resources (CPU, memory, I/O) allocated to a dedicated SQL pool.

Related: Performance Optimization

Dedicated SQL Pool

Provisioned resource offering enterprise-scale data warehousing capabilities with guaranteed resources.

Related: Architecture Overview

Delta Lake

Open-source storage layer that brings ACID transactions to data lakes.

Related: Delta Lake Guide

Delta Table

Table format in Delta Lake that supports ACID transactions, schema enforcement, and time travel.

Related: Table Optimization

DIU

Data Integration Unit Measure of compute power in Azure Data Factory representing a combination of CPU, memory, and network resources.

Related: Pipeline Optimization

Driver

Master process in Apache Spark that coordinates and schedules work across executors.

Related: Spark Configuration

DW Unit (DWU)

See Data Warehouse Unit.


E

ETL

Extract, Transform, Load Traditional data integration process that extracts data from sources, transforms it, then loads into destination.

Related: Integration Guide

ELT

Extract, Load, Transform Modern approach that loads raw data first, then transforms it in the destination system.

Related: Delta Lakehouse Architecture

Executor

Worker process in Apache Spark that runs tasks and stores data for the application.

Related: Spark Configuration

External Table

Table definition that references data stored outside the database, typically in a data lake.

Related: Serverless SQL Guide


F

Fault Tolerance

System's ability to continue operating properly in the event of failures.

Related: Best Practices

File Format

Structure in which data is stored. Common formats: Parquet, CSV, JSON, ORC, Avro.

Related: Serverless SQL Best Practices

Firewall Rule

Network security rule that controls incoming and outgoing traffic to Azure resources.

Related: Network Security


G

Graph Database

Database designed to treat relationships between data as equally important as the data itself.

Related: Architecture Patterns


H

Hive Metastore

Central repository of metadata for Hadoop, used by Spark to store table schemas and partition information.

Related: Shared Metadata

Hot Path

Real-time data processing path for immediate insights. Contrasts with cold path (batch processing).

Related: Real-time Analytics


I

Idempotent

Operation that produces the same result regardless of how many times it's executed.

Related: Pipeline Best Practices

Indexing

Database optimization technique that improves query performance by creating efficient data lookup structures.

Related: SQL Performance

Integration Runtime

Compute infrastructure used by Azure Data Factory to provide data integration across different network environments.

Related: Azure Data Factory Integration


J

JSON

JavaScript Object Notation Lightweight data interchange format that is easy to read and write.

Related: Serverless SQL Guide


K

Key Vault

Azure service for securely storing and accessing secrets, keys, and certificates.

Related: Security Best Practices


L

Lakehouse

See Data Lakehouse.

Lazy Evaluation

Execution model where transformations are not executed until an action is called. Used in Apache Spark.

Related: Spark Performance

Lineage

Tracking of data's origin, transformations, and movement through systems.

Related: Azure Purview Integration

Linked Service

Connection definition to external data sources or compute resources in Azure Synapse or Data Factory.

Related: Integration Guide


M

Managed Identity

Azure AD identity managed by Azure, eliminating the need for credentials in code.

Related: Security Best Practices

Managed Private Endpoint

Private endpoint managed by Azure Synapse for secure connectivity to Azure services.

Related: Private Link Architecture

Mapping Data Flow

Code-free data transformation feature in Azure Data Factory and Synapse.

Related: Integration Guide

Medallion Architecture

Data architecture pattern with bronze (raw), silver (cleaned), and gold (aggregated) layers.

Related: Delta Lakehouse Architecture

Merge Operation

Upsert operation (update if exists, insert if not) supported by Delta Lake.

Related: CDC Tutorial

Metadata

Data that provides information about other data (e.g., schema, statistics, lineage).

Related: Shared Metadata

MPP

Massively Parallel Processing Architecture that uses many processors working in parallel to quickly execute large-scale data operations.

Related: Architecture Overview


N

Notebook

Interactive document combining code, visualizations, and narrative text. Synapse supports Spark notebooks.

Related: PySpark Fundamentals

NSG

Network Security Group Azure firewall containing security rules to filter network traffic.

Related: Network Security


O

OPENROWSET

T-SQL function in serverless SQL pool for querying files in data lakes without creating external tables.

Related: Serverless SQL Guide

Optimize

Delta Lake command to compact small files into larger ones for better query performance.

Related: Table Optimization

ORC

Optimized Row Columnar Columnar storage file format optimized for Hadoop workloads.

Related: Performance Optimization


P

Parquet

Open-source columnar storage format designed for efficient data storage and retrieval.

Related: Serverless SQL Guide

Partition

Logical division of a large dataset for improved query performance and manageability.

Related: Delta Lake Optimization

Pipeline

Workflow that orchestrates data movement and transformation activities.

Related: Pipeline Optimization

PolyBase

Data virtualization feature for querying external data sources using T-SQL.

Related: SQL Performance

Private Endpoint

Network interface that connects privately and securely to Azure services using Azure Private Link.

Related: Private Link Architecture

PySpark

Python API for Apache Spark, enabling Spark programming using Python.

Related: PySpark Fundamentals


Q

Query Optimization

Process of improving query performance through various techniques like indexing, statistics, and query rewriting.

Related: Query Optimization


R

RBAC

Role-Based Access Control Authorization system for managing who has access to Azure resources and what they can do.

Related: Security Best Practices

RDD

Resilient Distributed Dataset Fundamental data structure in Apache Spark representing an immutable distributed collection.

Related: Spark Performance

Resource Group

Container that holds related resources for an Azure solution.

Related: Architecture Overview


S

Schema Evolution

Ability to handle changes in data schema over time without breaking existing queries.

Related: Delta Lake Guide

Schema on Read

Approach where data schema is applied when data is read, not when it's written. Used in data lakes.

Related: Serverless SQL Guide

Serverless SQL Pool

On-demand SQL query service with pay-per-query pricing model. No infrastructure to manage.

Related: Serverless SQL Overview

Service Principal

Identity created for use with applications, services, and automation tools to access Azure resources.

Related: Security Best Practices

Shuffle

Expensive operation in Spark where data is redistributed across partitions.

Related: Spark Performance

SLA

Service Level Agreement Commitment between service provider and customer regarding performance and availability.

Related: Best Practices

Slowly Changing Dimension (SCD)

Dimension that changes slowly over time rather than changing on regular schedule. Types include SCD Type 1, 2, 3.

Related: CDC Tutorial

Spark Pool

Managed Apache Spark cluster in Azure Synapse Analytics.

Related: Spark Configuration

SQL Pool

Collective term for both dedicated SQL pools and serverless SQL pools in Synapse.

Related: Architecture Overview

Statistics

Metadata about data distribution that helps query optimizer create efficient execution plans.

Related: SQL Performance

Storage Account

Azure resource that provides cloud storage for data objects including blobs, files, queues, and tables.

Related: Architecture Overview

Streaming

Continuous processing of data in real-time as it arrives.

Related: Real-time Analytics

Synapse Studio

Web-based integrated development environment for Azure Synapse Analytics.

Related: Environment Setup

Synapse Workspace

Collaborative environment for cloud-based enterprise analytics in Azure.

Related: Platform Overview


T

Table Distribution

Strategy for spreading table data across compute nodes. Types: hash, round-robin, replicated.

Related: SQL Performance

Time Travel

Delta Lake feature allowing queries of historical versions of data.

Related: Delta Lake Guide

Transformation

Operation that modifies data from source format to desired destination format.

Related: Integration Guide

Trigger

Automation that determines when a pipeline should run (scheduled, tumbling window, event-based).

Related: Pipeline Optimization


U

Upsert

Combination of update and insert operations. Updates existing records or inserts new ones if they don't exist.

Related: CDC Tutorial


V

Vacuum

Delta Lake command to remove old data files that are no longer referenced.

Related: Table Optimization

VNet

Virtual Network Isolated network in Azure that enables Azure resources to securely communicate with each other.

Related: Network Security

VNet Integration

Connecting Azure services to a virtual network for enhanced security and isolation.

Related: Private Link Architecture


W

Watermark

Marker used in incremental data loading to track which data has been processed.

Related: Pipeline Optimization

Workspace

See Synapse Workspace.


X

XML

Extensible Markup Language Markup language for encoding documents in a format that is both human-readable and machine-readable.


Y

YARN

Yet Another Resource Negotiator Resource management layer in Hadoop ecosystem. Not directly used in Synapse but relevant for understanding Spark.


Z

Z-Order

Delta Lake optimization technique that co-locates related information in the same set of files for faster queries.

Related: Table Optimization

Zone Redundancy

Azure storage redundancy option that replicates data across availability zones.

Related: Best Practices


Acronym Quick Reference

Acronym Full Term Category
ACID Atomicity, Consistency, Isolation, Durability Database
ADF Azure Data Factory Service
ADLS Azure Data Lake Storage Service
CDC Change Data Capture Technique
CETAS CREATE EXTERNAL TABLE AS SELECT SQL
DIU Data Integration Unit Performance
DWU Data Warehouse Unit Performance
ELT Extract, Load, Transform Pattern
ETL Extract, Transform, Load Pattern
MPP Massively Parallel Processing Architecture
NSG Network Security Group Security
ORC Optimized Row Columnar File Format
RBAC Role-Based Access Control Security
RDD Resilient Distributed Dataset Spark
SCD Slowly Changing Dimension Data Warehouse
SLA Service Level Agreement Operations
VNet Virtual Network Networking
YARN Yet Another Resource Negotiator Hadoop

Resource Description
Architecture Overview Architectural concepts and patterns
Best Practices Implementation best practices
Tutorials Hands-on learning materials
Code Examples Practical code samples
FAQ Frequently asked questions

💡 Tip: Use Ctrl+F (or Cmd+F on Mac) to quickly search for specific terms on this page.

Last Updated: January 2025