Data Warehouses, Data Lakes, and Lakehouses - Data Engineering for Beginners (Ch.6)

Chapter 6 is where the book zooms out from “how to design one database” to “where does all this data actually live in a real company.” The answer: it depends on what you are trying to do with it.

Nwokwu introduces three big storage concepts here: data warehouses, data lakes, and data lakehouses. Each one solves a different problem. And the fictional bakery, Dough & Delight, keeps growing throughout the chapter to show why you would need each one.

The Problem with Regular Databases

Dough & Delight has a normal transactional database with tables for orders, products, prices, and order details. It works great for taking orders and processing payments. But at the end of the week, someone asks: “How many cupcakes did we sell?”

To answer that, you need to join the Orders table with Products and OrderDetails, filter for cupcakes, sum the quantities, maybe pull in prices. It works, but here is the problem: this query has to scan thousands of rows, join multiple normalized tables, and compete with the live transactions happening at the same time. During peak hours, it slows everything down.

There is another problem. The price table only stores the current price. If cupcake prices changed mid-week, you have no record of what they used to be. Year-over-year analysis? Forget about it.

This is the difference between OLTP and OLAP:

  • OLTP (Online Transaction Processing): handles the day-to-day stuff. Fast reads and writes, simple operations. “Process this order.” “Update inventory.”
  • OLAP (Online Analytical Processing): handles the big questions. “What were our top products this month?” “Are there seasonal trends?” These queries scan huge amounts of data and do heavy aggregation.

Running OLAP queries on an OLTP database is like asking a sprinter to run a marathon. They can do it, but it is not what they are built for.

Data Warehouses

A data warehouse is a database built specifically for analytics. It stores large volumes of structured, historical data from multiple source systems. It is denormalized (fewer tables, fewer joins), and it is read-heavy by design.

Popular ones: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse Analytics.

ETL: Getting Data In

Before data lands in a warehouse, it goes through ETL: Extract, Transform, Load.

Extract pulls data from source systems. Databases, flat files, APIs, whatever.

Transform cleans it up. This includes:

  • Data enrichment (adding useful context like geographic info)
  • Reformatting (splitting or merging columns)
  • Cleaning (fixing nulls, removing duplicates)
  • Standardization (consistent date formats, etc.)
  • Aggregation (summing, averaging, counting)

Load puts it into the warehouse. Two strategies here:

  • Full load: dump everything in. Good for initial setup or small datasets.
  • Incremental load: only load what changed since last time. Much faster for ongoing updates.

Star Schema

The star schema is the bread and butter of data warehouse design. You have one fact table in the center that stores the numbers (quantities sold, revenue, etc.) and multiple dimension tables around it that store the descriptions (product names, customer details, dates, locations).

It looks like a star. Fact table in the middle, dimensions radiating out.

For Dough & Delight, the fact table is Sales:

Sales Fact Table:
SaleID | ProductID | CustomerID | StoreID | DateID | Quantity_Sold | Total_Sale_Amount

And the dimensions give context:

Product Dimension: ProductID, ProductName, Category, Flavor, Price
Customer Dimension: CustomerID, Name, Age, Email, LoyaltyLevel
Store Dimension: StoreID, StoreName, Location, Region
Date Dimension: DateID, Date, DayOfWeek, Month, Quarter, Year

Now a query like “total sales by product” is just one join:

SELECT p.ProductName, SUM(s.Total_Sales_Amount) AS Total_Sales
FROM Sales s
JOIN Products p ON s.Product_ID = p.Product_ID
GROUP BY p.ProductName
ORDER BY Total_Sales DESC;

Compare that to the transactional database version, which needs three joins and a multiplication. The warehouse query is simpler and runs on data that is already structured for this kind of analysis.

Snowflake Schema

The snowflake schema is a variation of the star schema where the dimension tables are further broken down into sub-tables. A Location dimension might split into Country, State, and City tables.

Advantages: less data duplication, saves storage, better for hierarchical data. Disadvantages: more joins, more complexity, slower queries.

Here is what I found: for most teams, the star schema is the right default. It is simpler, faster to query, and easier to maintain. Use a snowflake schema when you have complex hierarchical dimensions and storage savings really matter.

Slowly Changing Dimensions

Remember the cupcake pricing problem? Dimension data changes over time. Nwokwu covers three ways to handle this:

Type 1 (Overwrite): Just replace the old value with the new one. Simple, but you lose history. Good for fixing typos.

Type 2 (Add a new row): Keep the old record and add a new one with a flag or timestamp showing which is current. This is the most common approach when history matters.

ProductID | ProductName | Price | StartDate  | EndDate
93201     | cupcake     | 5.40  | 2024-11-13 | 2025-01-03
93201     | cupcake     | 7.80  | 2025-01-04 | null

Type 3 (Add a column): Store current and previous values side by side. Only tracks one change, but it is simple.

WarehouseID | Name                  | CurrentAddress | PreviousAddress
562819      | Cinnamon Square Depot | Riverside Mall | Parkside Avenue

Most production systems use Type 2 when they care about history.

Data Marts

A data mart is a smaller subset of a data warehouse, focused on one department or business area. Marketing gets a marketing data mart. Finance gets a finance data mart. Each one has only the data that team needs.

Benefits: faster queries (less data to scan), simpler for domain-specific work, cheaper to set up.

But here is the problem: too many independent data marts create data silos. Marketing’s revenue number does not match finance’s number. Nobody agrees on what “a customer” means. You lose your single source of truth. The book rightly warns about this. Build data marts with a unified strategy, not as standalone projects.

Data Lakes

Dough & Delight has expanded. They now have 50 stores, a website with customer reviews, recipe videos, and all kinds of unstructured data. A data warehouse handles structured data well, but it cannot store videos and raw text efficiently.

Enter the data lake. A data lake stores everything: structured, semi-structured, and unstructured data, all in its raw format. You dump it in first and figure out the structure later. Popular options: Amazon S3, Azure Data Lake Storage, Google Cloud Storage.

The key difference from a warehouse: data lakes use ELT (Extract, Load, Transform) instead of ETL. You load the raw data first, then transform it when you actually need it.

Data in a lake flows through three zones:

  1. Raw Zone: unprocessed data as it arrives. No cleaning, no transformations. Your safety net if you ever need to go back to the original.
  2. Cleansed Zone: basic fixes applied. Date formats standardized, missing values handled, duplicates removed. Usable but not business-ready.
  3. Curated Zone: enriched and transformed for specific uses. This is what dashboards and ML models consume.

In practice, this often looks like folders in S3:

s3://your-bucket/
    raw-zone/
        source1/2025-01-01/data.json
    cleansed-zone/
        source1/2025-01-01/cleaned_data.json
    curated-zone/
        reports/monthly_sales_2025-01.csv
        machine-learning-models/

Two important concepts here:

Data cataloging is like the index of a library. It organizes metadata (what data exists, where it came from, how it is structured) so people can actually find and use what is in the lake.

Schema-on-read means the data has no fixed structure when stored. You apply a schema only when you query it. A data scientist and a business analyst can use the same dataset with different schemas. Flexible, but it requires discipline.

The main risk with data lakes: without good governance, they turn into data swamps. Just a massive pile of files nobody can find or trust.

Data Lakehouses

Now Dough & Delight wants to combine structured analytics with unstructured data analysis. “Which recipe videos do customers like most?” “Are positive reviews correlated with more purchases?” These questions need data from both the warehouse and the lake.

A data lakehouse combines both into a single system. You get the raw storage of a data lake, the structured query performance of a warehouse, and ACID transaction support.

The architecture has three layers:

  • Storage layer: the data lake where raw data lives
  • Metadata layer: a catalog that structures and indexes everything
  • Consumption layer: a query engine that connects to both the lake and the catalog

This can help you avoid maintaining two separate systems (lake + warehouse) with ETL pipelines copying data between them. One system, one copy of the data.

The Comparison

Here is what I found useful from the book’s summary table:

DatabaseData WarehouseData LakeData Lakehouse
WorkloadTransactional (OLTP)Analytical (OLAP)AnalyticalBoth
Data typesStructuredStructuredAll typesAll types
SchemaSchema-on-writeSchema-on-writeSchema-on-readBoth
Best forReal-time appsReporting, BIRaw data storage, MLUnified analytics
Storage formatRow-basedColumnarFlexible (Parquet, JSON, etc.)Flexible + tuned

The short version: databases run your app, warehouses answer your business questions, lakes store everything for future use, and lakehouses try to do it all in one place.

My Take

Chapter 6 covers a lot of ground, and it does a good job of building up from “why do we need this?” to “how does it work?” The bakery example running throughout the chapter makes each concept concrete. You see why a regular database is not enough, why a warehouse helps, why a lake is needed for unstructured data, and why a lakehouse tries to unify everything.

If you are just starting out, here is my practical advice: learn the warehouse first. Understand fact tables, dimension tables, and ETL. That is the foundation. Data lakes and lakehouses build on top of that.

Next up: Chapter 7 on data pipelines, where we get into batch vs. streaming processing.


This is part 9 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.

| < Previous: Normalization and Database Design | Next: Batch and Streaming Pipelines > |

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More