The Evolution of Data Architecture

We’ve all heard the terms “Data Warehouse” and “Data Lake,” but do you actually know why we keep switching between them? In Chapter 4 of Big Data on Kubernetes, Neylson Crepalde gives a masterclass on how data architecture has evolved to keep up with the modern world.

Here is the breakdown of how we got to where we are today.

The Old Guard: Data Warehouses

For decades, the Data Warehouse was the king. It was great for structured data (think tables from your CRM). But it had a few big problems:

  • Schema-on-write: You had to define exactly what your data looked like before you saved it. If your requirements changed, you were in for a long weekend of refactoring.
  • Structured only: It couldn’t handle images, videos, or raw logs very well.
  • Batch latency: Data was usually updated daily or weekly. In a world that moves in seconds, that’s just too slow.

The Wild West: Data Lakes

Then came the Data Lake. It was the complete opposite. You could dump anything into it—JSON, CSV, MP4—and figure out the schema later (Schema-on-read).

But without proper care, these “lakes” quickly turned into “data swamps.” Finding anything was impossible, and because they used basic object storage, you couldn’t easily update a single row of data. You had to rewrite the whole file.

The Hybrid Heros: Lambda and Kappa

To solve the speed vs. accuracy problem, two famous architectures emerged:

  1. Lambda Architecture: This is the “hybrid” approach. It has a Batch Layer for historical accuracy and a Speed Layer for real-time updates. A Serving Layer merges them together when you run a query. It’s complex, but it works.
  2. Kappa Architecture: This simplifies things by treating everything as a stream. No separate batch layer. It’s elegant but can be much harder to implement and scale.

The New Standard: The Data Lakehouse

The industry is now settling on the Data Lakehouse. It’s exactly what it sounds like: the flexibility of a Data Lake combined with the ACID transactions and SQL performance of a Warehouse.

A common way to organize this is the Medallion Design:

  • Bronze: Raw data, exactly as it arrived.
  • Silver: Cleaned and integrated data. Analysis-ready.
  • Gold: Aggregated metrics and KPIs for your final dashboards.

Understanding these patterns is crucial because Kubernetes is the perfect platform to host them. It has the flexibility to run the batch jobs, the streaming clusters, and the SQL engines all in one place.

In the next post, we’ll look at the specific open-source tools that make this modern stack actually work.

Next: The Tools of the Modern Data Stack Previous: Scaling to the Cloud with Amazon EKS

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More