Jump to content

Lakehouse

From Wikipedia, the free encyclopedia

Data lakehouses are a hybrid approach that can ingest a variety of raw data formats like a data lake, while also providing ACID transactions and enforced data quality like a data warehouse.[1]

The architecture was outlined in a 2020 paper by researchers at Databricks, who proposed combining the management features and transactional guarantees of a data warehouse with the low-cost storage and open file formats characteristic of a data lake.[1] The term was subsequently adopted by other data platform vendors offering similar architectures.[2]

Technically, a data lakehouse is typically built on open table formats such as Delta Lake, Apache Iceberg, or Apache Hudi, which are layered over open file formats such as Apache Parquet on object storage. These table formats add transactional semantics, schema enforcement, and time-travel queries to data stored in conventional cloud object stores, allowing multiple compute engines to read and write the same tables.[1]

A common organizational pattern within lakehouse implementations is the "medallion" architecture, in which data is progressively refined through bronze (raw), silver (cleaned), and gold (aggregated) layers within the same storage system. This allows business intelligence, machine learning, and other analytic workloads to operate against a single managed copy of the data rather than separate warehouse and lake systems.[2]

References

[edit]
  1. ^ a b c Armbrust, Michael; Ghodsi, Ali; Xin, Reynold; Zaharia, Matei (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (PDF). 11th Annual Conference on Innovative Data Systems Research (CIDR).
  2. ^ a b Stedman, Craig. "What is a data lakehouse?". TechTarget. Retrieved 2026-06-02.