Data Engineering with AWS Chapter 2: Data Management Architectures for Analytics
Chapter 1 gave us the “who” and “why” of data engineering. Now it is time for the “where.” Where does all that data actually live? How do organizations store, organize, and serve billions of rows of information so that someone on the business side can pull up a dashboard and make a decision before lunch?
This is post 3 in my Data Engineering with AWS retelling series.
Chapter 2 walks through the evolution of data storage architectures – from old-school data warehouses to modern data lakehouses. If you have ever been confused by these terms, this post will sort it out for you.
A Quick History of Storing Data for Analytics
Back in the 1980s, companies processed data in big nightly batch runs on mainframe computers. Then the 1990s happened. Databases exploded everywhere. A single company might have tens or even hundreds of separate databases, each running a different part of the business. Sales had one. Inventory had another. HR had its own. They were all islands.
The problem? When the CEO asked “how did we do last quarter across all divisions?” nobody could answer quickly. Each database was built for running the business, not for answering big-picture questions.
That is when data warehouses entered the picture.
The Data Warehouse: One Place for All the Important Stuff
A data warehouse is a central repository that pulls data from all those scattered databases into one place. It stores well-structured, cleaned-up, highly trusted data. Think of it like a library where every book is cataloged, organized, and quality-checked before it goes on the shelf.
The key characteristics of a traditional data warehouse:
- Well-integrated – data from multiple sources is combined and consistent
- Highly structured – everything lives in neat rows and columns
- Curated – the data goes through serious validation and cleaning before it enters
- Trusted – business leaders can rely on it for real decisions
Early data warehouses were custom-built on powerful servers. By the mid-2000s, companies like Oracle, Teradata, and IBM started selling purpose-built hardware appliances designed specifically for warehousing at terabyte and petabyte scale.
Then the cloud showed up and changed everything. Amazon Redshift launched in 2013, and other providers followed with Snowflake, Google BigQuery, and Azure Synapse. Cloud data warehouses gave you petabyte scale without buying a single server. You pay for what you use. You scale up when you need to and scale down when you do not.
How Modern Data Warehouses Actually Work
Two technical ideas make modern warehouses fast: Massively Parallel Processing (MPP) and columnar storage.
Massively Parallel Processing
Imagine you have a huge pile of exams to grade. Instead of grading them all yourself, you hand stacks to twenty teaching assistants who work in parallel. That is MPP.
In Amazon Redshift, a cluster has a leader node and multiple compute nodes. When you run a query, the leader node takes it apart, creates an execution plan, and sends pieces of the work to all the compute nodes at once. Each compute node processes its own portion of the data. The results come back fast because everyone worked simultaneously.
Need more speed? Add more compute nodes. That is horizontal scaling.
Columnar Storage
Traditional databases store data row by row. If you have a table with 50 columns and you only need 3 of them for your query, a row-based database still reads all 50 columns for every row. Wasteful.
Modern data warehouses flip this around. They store data column by column. So when your query asks for just sales amount, product name, and date, the warehouse reads only those three columns and skips everything else. Way less disk reading, way faster queries.
Bonus: since all values in a column share the same data type (all integers, all strings, etc.), compression works much better. Smaller data on disk means even faster reads.
Star Schemas and Snowflake Schemas
Data in a warehouse is organized using dimensional models. The two most popular are the star schema and the snowflake schema.
Star Schema
Picture a star shape. In the center sits a fact table – this holds your measurable numbers like sales amount, quantity sold, or revenue. Around it sit dimension tables that give context: who bought it (customer), what was it (product), where was it sold (store), and when (date).
The fact table has foreign keys pointing to each dimension table. Business analysts slice and dice the facts through different dimensions to generate insights. What product sold most in Q3? Which store had the highest revenue in electronics?
Star schemas are simple and fast. But dimension tables can get bloated with duplicate data.
Snowflake Schema
A snowflake schema takes the star schema and normalizes the dimension tables further. Instead of one big product dimension table, you might split it into a product table and a separate product category table. This reduces data duplication and keeps things cleaner.
The trade-off? More tables means more joins, which can slow down queries. You pick your poison based on your use case.
Data Marts: The Focused Subset
Not everyone in a company needs the full warehouse. The marketing team wants marketing data. Finance wants finance data. Making every team navigate a massive enterprise-wide schema is like handing someone a library card when they just want one specific book.
A data mart is a smaller, focused slice of the warehouse built for a specific team or business area. It has a simpler schema and only the data that group cares about. Faster to build, easier to understand, cheaper to run.
Data marts can be built two ways:
- Top-down: Take a slice from the existing warehouse
- Bottom-up: Pull data directly from source systems for a specific domain
ETL vs. ELT: Getting Data Into the Warehouse
Data does not magically appear in a warehouse. You need pipelines to extract it from source systems, clean it up, and load it in. There are two main patterns.
ETL (Extract, Transform, Load)
Pull data from sources. Land it in a staging area outside the warehouse. Transform it there – validate, clean, standardize, restructure. Then load the finished product into the warehouse.
ETL works well when:
- Source systems use different technologies than the warehouse
- Data volumes are small to moderate
- Transformations are complex and compute-heavy
Common ETL tools include AWS Glue, Informatica, Talend, and DataStage.
ELT (Extract, Load, Transform)
Pull data from sources. Load it directly into a staging area inside the warehouse as-is. Then use the warehouse engine itself to do the transformations.
ELT works well when:
- Source and warehouse use similar database technologies
- You have large volumes that need to land fast
- The warehouse engine (like Redshift with MPP) can handle the transformation work efficiently
The difference is basically where the heavy lifting happens. ETL transforms outside the warehouse. ELT transforms inside it.
The Data Lake: When Warehouses Are Not Enough
Data warehouses are great for structured, tabular data. But what about video files? Audio recordings? JSON logs from web servers? Sensor data from IoT devices? Social media posts?
This is semi-structured and unstructured data, and it has been growing exponentially. Warehouses were not designed for it. They want neat rows and columns. They couple compute and storage together. And they only speak SQL.
A data lake is a massive, central repository (usually built on cloud object storage like Amazon S3) where you can dump any kind of data: structured, semi-structured, unstructured. No need to convert it into a specific format first. Just load it in.
Cloud data lakes give you:
- Virtually unlimited, cheap storage
- Ability to store any data format as-is
- Integration with diverse analytics tools (SQL, Spark, machine learning frameworks, BI tools)
- Decoupled compute and storage – scale each independently
Data Lake Zones
A well-designed data lake organizes data into zones:
- Landing/Raw Zone – data arrives here straight from source systems, completely untouched. This is your permanent archive of raw data.
- Clean/Transform Zone – data gets validated, cleaned, and often converted to optimized formats like Parquet. Sensitive data gets masked or removed here.
- Curated/Enriched Zone – data is refined with business logic, fully cataloged, partitioned, and optimized for consumption. This is the “ready to eat” zone.
Some lakes use two zones, some use five. It depends on complexity. The point is that data flows through stages, getting cleaner and more useful at each step.
The Five Layers of a Data Lake
Think of a data lake as five logical layers working together:
- Storage Layer – the actual storage (S3 buckets) organized into zones
- Cataloging and Search Layer – metadata about every dataset so people can find what they need (AWS Glue Catalog, Lake Formation)
- Ingestion Layer – tools that connect to source systems and bring data in (AWS DMS for databases, Kinesis Firehose for streams)
- Processing Layer – transforms raw data through the zones (AWS Glue, EMR with Spark)
- Consumption Layer – analytics tools that read the processed data (Athena for SQL queries, QuickSight for dashboards, SageMaker for ML)
The Data Lakehouse: Best of Both Worlds
Here is the problem many organizations hit: they built a data warehouse AND a data lake separately. Two ingestion pipelines. Two governance systems. Two sets of tools. Data gets duplicated and moved between them constantly. Things get out of sync.
The data lakehouse architecture tries to fix this by combining the best features of both:
- Store all data types (like a lake)
- Support ACID transactions – insert, update, delete records cleanly (like a warehouse)
- Handle petabyte scale (like a lake)
- Provide low-latency access (like a warehouse)
- Work with SQL, Spark, ML tools, and BI tools (the full toolkit)
Lakehouse on AWS
On AWS, the lakehouse is built on two key services:
Redshift Spectrum lets your Redshift data warehouse query data stored in S3 directly. You write one SQL query that combines data from both Redshift tables and S3 files. No copying data back and forth.
AWS Lake Formation provides a central catalog where all datasets (warehouse and lake) are registered, searchable, and governed. AWS Glue crawlers automatically discover datasets in S3, extract their metadata, and register them in the catalog.
Other lakehouse approaches exist too. Databricks has Delta Lake, which brings ACID transactions directly to data lake storage. Apache Hudi is an open-source option for insert/update/delete operations on lake data. Microsoft Azure has Polybase in Synapse Analytics.
The big idea: stop treating your warehouse and lake as separate kingdoms. Unify them so analysts and data scientists can access everything from one place.
Hands-On: Setting Up the AWS CLI
The chapter wraps up with a practical exercise: installing and configuring the AWS CLI, then creating your first S3 bucket. Here is the quick version:
# Install the AWS CLI from https://aws.amazon.com/cli/
# Configure it with your credentials
aws configure
# It will ask for:
# - AWS Access Key ID
# - AWS Secret Access Key
# - Default region (e.g., us-east-2)
# - Default output format (just press Enter)
# Create your first S3 bucket
aws s3 mb s3://your-unique-bucket-name
# If it works, you will see:
# make_bucket: your-unique-bucket-name
Bucket names must be globally unique across all AWS accounts. If someone else already took the name, you get an error. Pick something specific to you.
If you already have the AWS CLI configured for a different account, you can set up a named profile:
aws configure --profile dataengbook
# Then use it like:
aws s3 ls --profile dataengbook
Wrapping Up
Chapter 2 is about understanding where data lives and how it gets there. The evolution went roughly like this:
- 1980s-1990s: Scattered databases everywhere, no unified analytics
- 1990s-2000s: Data warehouses centralized structured data for reporting
- 2010s: Data lakes handled the explosion of unstructured and semi-structured data
- Now: Data lakehouses combine both into a unified architecture
As a data engineer, you will encounter all of these in the real world. Legacy warehouses are not going away. Data lakes are everywhere. And lakehouses are the direction things are moving. Understanding all three gives you the vocabulary and mental models to work on any modern data platform.
Next chapter, we get into the actual AWS tools you will use to build all of this.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 1 - What Even Is Data Engineering? Next: Chapter 3 Part 1 - The AWS Data Engineer’s Toolkit