Data Lineage

Data lineage is the complete chain of origin, movement, and transformation of data through a system, showing where each piece of data came from and how it was changed.

Definition

Data lineage answers the question: 'Where did this number come from, and how was it calculated?' It's a map of data as it flows through pipelines—source tables, transformation steps, derived columns, and final destinations. Complete lineage shows both forward dependencies (what reports depend on this table?) and backward dependencies (which source feeds into this report?). Lineage is essential for compliance, debugging, and understanding business logic. When a metric is wrong, lineage helps you quickly pinpoint the problem: Was it a source data issue? A calculation error? A recent transformation change? Tools like Collibra, Alation, and dbt generate lineage automatically by analyzing your data code.

How It Works

1. Capture: Tools ingest SQL, Python, or config files from your pipeline. 2. Parse: Identify source tables, transformations, output tables. 3. Link: Build a graph showing data movement. 4. Visualize: Display lineage as a DAG in the tool's UI. 5. Query: Ask 'impact analysis' questions: if I change this table, what downstream reports are affected?

When to Use It

Invest in lineage tooling early. Lineage becomes indispensable for compliance (GDPR, HIPAA—'show me the lineage of this customer data'), for debugging ('why is this metric wrong?'), and for preventing breaking changes ('who depends on this table I'm about to modify?'). For small teams, dbt provides excellent lineage out of the box.

Relevant Tools

Last updated: Jun 17, 2026