Idempotency

Idempotency is the property of an operation that produces the same result no matter how many times it runs, crucial for reliable data pipelines that may be interrupted and retried.

Definition

An idempotent operation produces the same outcome whether run once or a hundred times. In data pipelines, idempotency is essential because real-world systems fail and retry: a network connection drops mid-transfer, a process crashes, or an operator re-runs a job by mistake. Without idempotency, retrying a failed step could corrupt your data (duplicate rows, double-counted metrics, lost transactions). Idempotent pipelines are resilient and forgiving—you can replay steps without worry. Achieving idempotency typically involves deduplication logic (upserts instead of inserts), tracking processed records, or recomputing derived data from source facts rather than accumulating deltas.

How It Works

1. Bad (non-idempotent): INSERT INTO sales (SELECT * FROM staging). Retry adds duplicates. 2. Better: DELETE staging; INSERT sales. Still not fully idempotent. 3. Good (idempotent): MERGE INTO sales USING staging ON id WHERE 1=1 WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT .... 4. Verify: Re-run the job; data is unchanged.

When to Use It

Design every production pipeline to be idempotent. It's a non-negotiable requirement for reliability. Idempotent operations let your orchestrator safely retry failures, let you replay historical data, and let you re-run transformations without fear. The slight overhead of deduplication logic is worth the confidence and operational safety.

Last updated: Jun 17, 2026