Data Pipeline Glossary

30 terms defined — plain language for practitioners.

Definitions written for engineers and analytics leads who need precision, not marketing copy. Each entry covers what the term means, how it works in practice, and when to use it.

Apache Kafka

Apache Kafka is a distributed event streaming platform that enables high-throughput, low-latency publishing and subscribing to data streams, acting as the central nervous system for real-time data pipelines.

Apache Spark

Apache Spark is an open-source distributed computing framework that processes large-scale data in parallel across multiple machines, supporting batch processing, streaming, and machine learning.

Batch Processing

Batch processing is the technique of collecting and processing large volumes of data together at scheduled intervals rather than processing each record individually as it arrives.

CDC (Change Data Capture)

Change Data Capture (CDC) is a technique that identifies and captures changes made to data in source systems, enabling only the modified rows to be replicated rather than the entire dataset.

Change Data Capture

Change Data Capture (CDC) is a technique that identifies and captures changes made to data in source systems, enabling only the modified rows to be replicated rather than the entire dataset.

DAG (Directed Acyclic Graph)

A DAG is a directed acyclic graph—a visual representation of task dependencies in a data pipeline, showing which tasks must complete before others can begin, with no cycles or circular dependencies.

Data Catalog

A data catalog is a searchable inventory of all data assets in an organization—tables, dashboards, metrics, reports—with metadata describing what each asset is, who owns it, and how it's used.

Data Fabric

Data Fabric is an architecture that integrates disparate data sources through intelligent connections and metadata, enabling seamless data access and processing across an organization's entire ecosystem.

Data Governance

Data governance is the set of policies, processes, and standards that ensure data is accurate, secure, and used responsibly across an organization.

Data Ingestion

Data ingestion is the process of extracting data from source systems and loading it into a central repository, the first step in any data pipeline.

Data Lake

A data lake is a large, centralized repository that stores raw, unstructured data in its native format—from logs and images to JSON and Parquet—allowing exploration and analysis without upfront schema definition.

Data Lakehouse

A data lakehouse combines the cost-efficiency and scalability of a data lake with the schema enforcement and query performance of a data warehouse, enabling a unified analytics architecture.

Data Lineage

Data lineage is the complete chain of origin, movement, and transformation of data through a system, showing where each piece of data came from and how it was changed.

Data Mesh

Data Mesh is an organizational and architectural approach that decentralizes data ownership, treating data as a product managed by domain teams rather than a centralized IT function.

Data Orchestration

Data orchestration is the coordination and automation of multiple data tasks and dependencies, ensuring pipelines run reliably, in the correct order, and with proper error handling.

Data Pipeline

A data pipeline is an automated set of processes that extract, transform, and load data from source systems to destinations, enabling data to flow reliably through an organization.

Data Quality

Data quality refers to the accuracy, completeness, consistency, and timeliness of data, ensuring that data is fit for its intended use in business decisions and operations.

Data Transformation

Data transformation is the process of converting raw data into a clean, structured, and enriched format suitable for analysis or operational use.

Data Warehouse

A data warehouse is a centralized, organized repository designed for analytics, built to combine data from multiple operational systems into a unified, queryable schema optimized for reporting and business intelligence.

dbt Model

A dbt model is a SQL query or Python script that transforms raw data into a derived table, with built-in testing, documentation, and dependency management.

ELT (Extract, Load, Transform)

ELT is a modern data integration pattern where data is extracted from sources, loaded directly into a target system (usually a cloud data warehouse), and transformed in-place using the target system's compute.

ETL (Extract, Transform, Load)

ETL is a data integration pattern that extracts data from source systems, transforms it to meet business requirements, and loads it into a target system like a data warehouse.

Idempotency

Idempotency is the property of an operation that produces the same result no matter how many times it runs, crucial for reliable data pipelines that may be interrupted and retried.

Materialized View

A materialized view is a pre-computed query result stored as a physical table, providing fast access to aggregated or joined data at the cost of manual refresh latency.

Reverse ETL

Reverse ETL is the process of extracting data from a data warehouse and loading it into operational systems, CRMs, marketing platforms, and other business applications to activate insights.

Schema Drift

Schema drift occurs when the structure of data changes unexpectedly—new columns appear, types change, or fields are removed—breaking downstream pipelines that expect a fixed schema.

Slowly Changing Dimension (SCD)

A Slowly Changing Dimension (SCD) is a dimension table in a data warehouse whose attributes change slowly over time, requiring strategies to track both current and historical values.

Snowflake Schema

A snowflake schema is a normalized variation of the star schema where dimension tables are further normalized into sub-dimensions, reducing redundancy at the cost of slightly more complex queries.

Star Schema

A star schema is a dimensional data warehouse design with a central fact table (quantitative events) surrounded by dimension tables (descriptive attributes), optimizing for analytical queries.

Stream Processing

Stream processing is the continuous, real-time processing of data as it arrives, enabling near-instantaneous analysis and action on flowing data rather than waiting to batch it up.