Data Lake

A data lake is a large, centralized repository that stores raw, unstructured data in its native format—from logs and images to JSON and Parquet—allowing exploration and analysis without upfront schema definition.

Definition

A data lake is a storage repository that holds vast amounts of data in its raw, native format. Unlike data warehouses, which require data to be structured, cleaned, and loaded into a predefined schema, data lakes accept data as-is: JSON documents, logs, images, video, sensor data, clickstreams. The 'schema-on-read' approach means the structure is applied at query time, not at load time. Data lakes are low-cost (object storage in S3 or GCS), highly scalable, and ideal for exploratory analytics. However, without governance, they can become 'data swamps'—hard to find useful data or understand its quality.

How It Works

1. Ingest: Raw data from any source lands in object storage (S3, GCS, ADLS). 2. Catalog: Metadata tools (Glue, Hive metastore) index what's there. 3. Explore: Data engineers or analysts query raw data with SQL or Spark. 4. Refine: High-value data flows into a warehouse for operational use. 5. Archive: Old or unused data remains available if needed.

When to Use It

Use a data lake when you're ingesting diverse data types at scale, when you're not sure yet what you'll use the data for (exploration phase), or when you need to retain raw data for compliance. Data lakes are excellent for machine learning (raw training data), unstructured content (logs, images), and cost-sensitive archival. Combine with a data warehouse for governed, high-quality datasets.

Last updated: Jun 17, 2026