Data Ingestion

Data ingestion is the process of extracting data from source systems and loading it into a central repository, the first step in any data pipeline.

Definition

Data ingestion is the mechanics of extracting raw data from its source (a production database, SaaS API, log file, IoT sensor, or stream) and moving it to a central location for processing. Ingestion handles the connectivity, error recovery, and data movement details so downstream teams can focus on transformation and analysis. Ingestion can be batch-oriented (pull all changes since last run) or streaming (continuously pull changes). Modern data ingestion tools abstract away the complexity of source-specific protocols and authentication, making it easier to add new data sources without custom code.

How It Works

1. Connect: Establish authenticated connection to the source system. 2. Extract: Query or poll for data (or subscribe to a stream). 3. Transfer: Move data over the network to your pipeline. 4. Land: Store temporarily in staging area or directly in destination. 5. Checkpoint: Record what was ingested so next run knows where to resume.

When to Use It

Every data pipeline starts with ingestion. Plan ingestion architecture early: decide whether to batch or stream, how frequently you need data, and what latency is acceptable. Common ingestion patterns: nightly batch exports from operational databases, CDC for real-time table replication, API polling for SaaS data, and streaming message queues for logs.

Last updated: Jun 17, 2026