Azure Analytics Glossary¶
🏠 Home > 📖 Reference > 📚 Glossary
📚 Terminology Reference Comprehensive glossary of Azure analytics terms, acronyms, and concepts.
Navigation¶
A¶
ACID¶
Atomicity, Consistency, Isolation, Durability Properties that guarantee database transactions are processed reliably. Delta Lake provides ACID guarantees for data lakes.
Related: Delta Lake Guide
ADF¶
Azure Data Factory Cloud-based data integration service for creating, scheduling, and orchestrating data workflows.
Related: Azure Data Factory Integration
ADLS¶
Azure Data Lake Storage Scalable and secure data lake for high-performance analytics workloads. ADLS Gen2 combines the capabilities of ADLS Gen1 and Azure Blob Storage.
Related: Architecture Overview
Apache Spark¶
Open-source distributed computing system for big data processing. Synapse Spark pools run Apache Spark workloads.
Related: Spark Performance
Auto Loader¶
Delta Lake feature for incrementally and efficiently processing new data files as they arrive in cloud storage.
Related: Auto Loader Tutorial
Azure Active Directory (Azure AD)¶
Microsoft's cloud-based identity and access management service. Now known as Microsoft Entra ID.
Related: Security Best Practices
Azure Purview¶
Unified data governance service that helps manage and govern on-premises, multi-cloud, and SaaS data. Now part of Microsoft Purview.
Related: Azure Purview Integration
Azure Synapse Analytics¶
Unified analytics service that brings together enterprise data warehousing and big data analytics.
Related: Platform Overview
B¶
Batch Processing¶
Processing large volumes of data collected over a period of time. Contrasts with stream processing.
Related: Pipeline Optimization
Broadcast Join¶
Spark optimization technique where smaller datasets are broadcasted to all executors to avoid shuffling large datasets.
Related: Spark Performance
Built-in Serverless Pool¶
Pre-configured serverless SQL pool included with every Synapse workspace at no additional cost.
Related: Serverless SQL Overview
C¶
CDC¶
Change Data Capture Process of identifying and capturing changes made to data in a database, typically for replication or synchronization.
Related: CDC Tutorial
CETAS¶
CREATE EXTERNAL TABLE AS SELECT T-SQL command in serverless SQL pool to create external tables and export query results to storage.
Related: Serverless SQL Best Practices
Columnar Storage¶
Data storage format that stores data tables by column rather than by row. Examples: Parquet, ORC.
Related: Performance Optimization
Compute Node¶
Individual server in a distributed computing cluster that performs data processing tasks.
Related: Spark Configuration
Concurrency¶
Number of simultaneous operations or queries that can run at the same time.
Related: Performance Optimization
Copy Activity¶
Azure Data Factory activity used to copy data from source to destination with various transformations.
Related: Azure Data Factory Integration
D¶
Data Distribution¶
Strategy for spreading data across compute nodes in a distributed system. Types include hash, round-robin, and replicate.
Related: SQL Performance
Data Flow¶
Visual data transformation tool in Azure Data Factory and Synapse for building ETL logic without coding.
Related: Integration Guide
Data Lake¶
Storage repository that holds vast amounts of raw data in its native format until needed.
Related: Delta Lakehouse Architecture
Data Lakehouse¶
Architecture that combines the best features of data lakes and data warehouses.
Related: Delta Lakehouse Overview
Data Partitioning¶
Dividing large datasets into smaller, manageable pieces based on specific criteria (e.g., date, region).
Related: Delta Lake Optimization
Data Skew¶
Uneven distribution of data across partitions, causing some nodes to process more data than others.
Related: Spark Performance
Data Warehouse Unit (DWU)¶
Measure of compute resources (CPU, memory, I/O) allocated to a dedicated SQL pool.
Related: Performance Optimization
Dedicated SQL Pool¶
Provisioned resource offering enterprise-scale data warehousing capabilities with guaranteed resources.
Related: Architecture Overview
Delta Lake¶
Open-source storage layer that brings ACID transactions to data lakes.
Related: Delta Lake Guide
Delta Table¶
Table format in Delta Lake that supports ACID transactions, schema enforcement, and time travel.
Related: Table Optimization
DIU¶
Data Integration Unit Measure of compute power in Azure Data Factory representing a combination of CPU, memory, and network resources.
Related: Pipeline Optimization
Driver¶
Master process in Apache Spark that coordinates and schedules work across executors.
Related: Spark Configuration
DW Unit (DWU)¶
See Data Warehouse Unit.
E¶
ETL¶
Extract, Transform, Load Traditional data integration process that extracts data from sources, transforms it, then loads into destination.
Related: Integration Guide
ELT¶
Extract, Load, Transform Modern approach that loads raw data first, then transforms it in the destination system.
Related: Delta Lakehouse Architecture
Executor¶
Worker process in Apache Spark that runs tasks and stores data for the application.
Related: Spark Configuration
External Table¶
Table definition that references data stored outside the database, typically in a data lake.
Related: Serverless SQL Guide
F¶
Fault Tolerance¶
System's ability to continue operating properly in the event of failures.
Related: Best Practices
File Format¶
Structure in which data is stored. Common formats: Parquet, CSV, JSON, ORC, Avro.
Related: Serverless SQL Best Practices
Firewall Rule¶
Network security rule that controls incoming and outgoing traffic to Azure resources.
Related: Network Security
G¶
Graph Database¶
Database designed to treat relationships between data as equally important as the data itself.
Related: Architecture Patterns
H¶
Hive Metastore¶
Central repository of metadata for Hadoop, used by Spark to store table schemas and partition information.
Related: Shared Metadata
Hot Path¶
Real-time data processing path for immediate insights. Contrasts with cold path (batch processing).
Related: Real-time Analytics
I¶
Idempotent¶
Operation that produces the same result regardless of how many times it's executed.
Related: Pipeline Best Practices
Indexing¶
Database optimization technique that improves query performance by creating efficient data lookup structures.
Related: SQL Performance
Integration Runtime¶
Compute infrastructure used by Azure Data Factory to provide data integration across different network environments.
Related: Azure Data Factory Integration
J¶
JSON¶
JavaScript Object Notation Lightweight data interchange format that is easy to read and write.
Related: Serverless SQL Guide
K¶
Key Vault¶
Azure service for securely storing and accessing secrets, keys, and certificates.
Related: Security Best Practices
L¶
Lakehouse¶
See Data Lakehouse.
Lazy Evaluation¶
Execution model where transformations are not executed until an action is called. Used in Apache Spark.
Related: Spark Performance
Lineage¶
Tracking of data's origin, transformations, and movement through systems.
Related: Azure Purview Integration
Linked Service¶
Connection definition to external data sources or compute resources in Azure Synapse or Data Factory.
Related: Integration Guide
M¶
Managed Identity¶
Azure AD identity managed by Azure, eliminating the need for credentials in code.
Related: Security Best Practices
Managed Private Endpoint¶
Private endpoint managed by Azure Synapse for secure connectivity to Azure services.
Related: Private Link Architecture
Mapping Data Flow¶
Code-free data transformation feature in Azure Data Factory and Synapse.
Related: Integration Guide
Medallion Architecture¶
Data architecture pattern with bronze (raw), silver (cleaned), and gold (aggregated) layers.
Related: Delta Lakehouse Architecture
Merge Operation¶
Upsert operation (update if exists, insert if not) supported by Delta Lake.
Related: CDC Tutorial
Metadata¶
Data that provides information about other data (e.g., schema, statistics, lineage).
Related: Shared Metadata
MPP¶
Massively Parallel Processing Architecture that uses many processors working in parallel to quickly execute large-scale data operations.
Related: Architecture Overview
N¶
Notebook¶
Interactive document combining code, visualizations, and narrative text. Synapse supports Spark notebooks.
Related: PySpark Fundamentals
NSG¶
Network Security Group Azure firewall containing security rules to filter network traffic.
Related: Network Security
O¶
OPENROWSET¶
T-SQL function in serverless SQL pool for querying files in data lakes without creating external tables.
Related: Serverless SQL Guide
Optimize¶
Delta Lake command to compact small files into larger ones for better query performance.
Related: Table Optimization
ORC¶
Optimized Row Columnar Columnar storage file format optimized for Hadoop workloads.
Related: Performance Optimization
P¶
Parquet¶
Open-source columnar storage format designed for efficient data storage and retrieval.
Related: Serverless SQL Guide
Partition¶
Logical division of a large dataset for improved query performance and manageability.
Related: Delta Lake Optimization
Pipeline¶
Workflow that orchestrates data movement and transformation activities.
Related: Pipeline Optimization
PolyBase¶
Data virtualization feature for querying external data sources using T-SQL.
Related: SQL Performance
Private Endpoint¶
Network interface that connects privately and securely to Azure services using Azure Private Link.
Related: Private Link Architecture
PySpark¶
Python API for Apache Spark, enabling Spark programming using Python.
Related: PySpark Fundamentals
Q¶
Query Optimization¶
Process of improving query performance through various techniques like indexing, statistics, and query rewriting.
Related: Query Optimization
R¶
RBAC¶
Role-Based Access Control Authorization system for managing who has access to Azure resources and what they can do.
Related: Security Best Practices
RDD¶
Resilient Distributed Dataset Fundamental data structure in Apache Spark representing an immutable distributed collection.
Related: Spark Performance
Resource Group¶
Container that holds related resources for an Azure solution.
Related: Architecture Overview
S¶
Schema Evolution¶
Ability to handle changes in data schema over time without breaking existing queries.
Related: Delta Lake Guide
Schema on Read¶
Approach where data schema is applied when data is read, not when it's written. Used in data lakes.
Related: Serverless SQL Guide
Serverless SQL Pool¶
On-demand SQL query service with pay-per-query pricing model. No infrastructure to manage.
Related: Serverless SQL Overview
Service Principal¶
Identity created for use with applications, services, and automation tools to access Azure resources.
Related: Security Best Practices
Shuffle¶
Expensive operation in Spark where data is redistributed across partitions.
Related: Spark Performance
SLA¶
Service Level Agreement Commitment between service provider and customer regarding performance and availability.
Related: Best Practices
Slowly Changing Dimension (SCD)¶
Dimension that changes slowly over time rather than changing on regular schedule. Types include SCD Type 1, 2, 3.
Related: CDC Tutorial
Spark Pool¶
Managed Apache Spark cluster in Azure Synapse Analytics.
Related: Spark Configuration
SQL Pool¶
Collective term for both dedicated SQL pools and serverless SQL pools in Synapse.
Related: Architecture Overview
Statistics¶
Metadata about data distribution that helps query optimizer create efficient execution plans.
Related: SQL Performance
Storage Account¶
Azure resource that provides cloud storage for data objects including blobs, files, queues, and tables.
Related: Architecture Overview
Streaming¶
Continuous processing of data in real-time as it arrives.
Related: Real-time Analytics
Synapse Studio¶
Web-based integrated development environment for Azure Synapse Analytics.
Related: Environment Setup
Synapse Workspace¶
Collaborative environment for cloud-based enterprise analytics in Azure.
Related: Platform Overview
T¶
Table Distribution¶
Strategy for spreading table data across compute nodes. Types: hash, round-robin, replicated.
Related: SQL Performance
Time Travel¶
Delta Lake feature allowing queries of historical versions of data.
Related: Delta Lake Guide
Transformation¶
Operation that modifies data from source format to desired destination format.
Related: Integration Guide
Trigger¶
Automation that determines when a pipeline should run (scheduled, tumbling window, event-based).
Related: Pipeline Optimization
U¶
Upsert¶
Combination of update and insert operations. Updates existing records or inserts new ones if they don't exist.
Related: CDC Tutorial
V¶
Vacuum¶
Delta Lake command to remove old data files that are no longer referenced.
Related: Table Optimization
VNet¶
Virtual Network Isolated network in Azure that enables Azure resources to securely communicate with each other.
Related: Network Security
VNet Integration¶
Connecting Azure services to a virtual network for enhanced security and isolation.
Related: Private Link Architecture
W¶
Watermark¶
Marker used in incremental data loading to track which data has been processed.
Related: Pipeline Optimization
Workspace¶
See Synapse Workspace.
X¶
XML¶
Extensible Markup Language Markup language for encoding documents in a format that is both human-readable and machine-readable.
Y¶
YARN¶
Yet Another Resource Negotiator Resource management layer in Hadoop ecosystem. Not directly used in Synapse but relevant for understanding Spark.
Z¶
Z-Order¶
Delta Lake optimization technique that co-locates related information in the same set of files for faster queries.
Related: Table Optimization
Zone Redundancy¶
Azure storage redundancy option that replicates data across availability zones.
Related: Best Practices
Acronym Quick Reference¶
| Acronym | Full Term | Category |
|---|---|---|
| ACID | Atomicity, Consistency, Isolation, Durability | Database |
| ADF | Azure Data Factory | Service |
| ADLS | Azure Data Lake Storage | Service |
| CDC | Change Data Capture | Technique |
| CETAS | CREATE EXTERNAL TABLE AS SELECT | SQL |
| DIU | Data Integration Unit | Performance |
| DWU | Data Warehouse Unit | Performance |
| ELT | Extract, Load, Transform | Pattern |
| ETL | Extract, Transform, Load | Pattern |
| MPP | Massively Parallel Processing | Architecture |
| NSG | Network Security Group | Security |
| ORC | Optimized Row Columnar | File Format |
| RBAC | Role-Based Access Control | Security |
| RDD | Resilient Distributed Dataset | Spark |
| SCD | Slowly Changing Dimension | Data Warehouse |
| SLA | Service Level Agreement | Operations |
| VNet | Virtual Network | Networking |
| YARN | Yet Another Resource Negotiator | Hadoop |
Related Resources¶
| Resource | Description |
|---|---|
| Architecture Overview | Architectural concepts and patterns |
| Best Practices | Implementation best practices |
| Tutorials | Hands-on learning materials |
| Code Examples | Practical code samples |
| FAQ | Frequently asked questions |
💡 Tip: Use Ctrl+F (or Cmd+F on Mac) to quickly search for specific terms on this page.
Last Updated: January 2025