Data Engineering with AWS Chapter 5: Architecting Data Engineering Pipelines

This is post 8 in my Data Engineering with AWS retelling series.

You have learned about data engineering principles, data architectures, the AWS toolkit, and data governance. Now comes the part where it all comes together. Chapter 5 is about designing an actual data pipeline. Not writing code yet. Just thinking. Planning. Drawing on a whiteboard.

And honestly, this might be the most important chapter in the whole book.

Stop Trying to Boil the Ocean

The biggest mistake teams make when starting a data engineering project is trying to do everything at once. They look at the hundreds of data sources across the company, imagine all the possible dashboards and machine learning models, and try to design one massive system that handles it all on day one.

That almost always fails.

Gareth Eagar makes a great point here using the famous quote from Field of Dreams: “If you build it, they will come.” In the movie, that worked out. In data engineering, it usually does not. Multi-year projects that try to ingest every data source before anyone has asked for specific analytics tend to collapse under their own weight.

The better approach is simple. Pick one specific use case. Get executive sponsorship. Build a focused pipeline for that one thing. Deliver value. Then use that win as a case study to expand. You still keep the bigger picture in mind so you are not painting yourself into a corner, but you scope the first project tightly enough to actually finish it.

Think Like an Architect (the Building Kind)

Eagar draws a great analogy between building a house and building a data pipeline. When you hire an architect for a house, they do not start picking out bathroom tiles on day one. They sit down with you and ask questions. How many bedrooms? What materials? What is the lot like?

A data engineer does the same thing:

  • Talk to the people who will use the data. What do they need? What tools do they prefer? What questions are they trying to answer?
  • Understand the data sources. Where does the data live? What format is it in? Who owns it? How often does it change?
  • Figure out which tools fit. Based on what you learned in the first two steps, which AWS services make sense?

This gathering phase happens before you write a single line of code. The tool for this phase is a whiteboard.

The Whiteboarding Session

The chapter walks through a structured whiteboarding approach for pipeline design. You bring together stakeholders – data consumers, data owners, system administrators, business sponsors – and spend half a day sketching out a high-level architecture.

The key insight is that you work backward. You do not start with “what data do we have?” You start with “who needs data and what do they need it for?” Then you trace back to the sources.

Here is the sequence:

  1. Identify data consumers and their requirements. Who are they? Business users wanting dashboards? Data analysts running SQL queries? Data scientists building ML models?
  2. Determine what tools consumers will use. BI dashboards like Tableau or QuickSight? SQL query editors like Amazon Athena? Machine learning notebooks like SageMaker?
  3. Identify the data sources. Internal databases, third-party data feeds, log files, streaming events, marketplace data from AWS Data Exchange.
  4. Figure out ingestion. How does each data source get into the pipeline? Batch loads with AWS DMS? Streaming with Kinesis? Direct file drops to S3?
  5. Plan transformations at a high level. What needs to happen between raw data and consumer-ready data?

You sketch each piece on the whiteboard as you go. Consumers on the right side, sources on the left, transformations in the middle.

Know Your Data Consumers

The chapter identifies four main types of data consumers you will typically encounter:

Business users want interactive dashboards. They want to see last week’s sales by region, top products, campaign performance. They do not want to write SQL. They want to click and explore.

Business applications consume data programmatically. Think of Spotify Wrapped – that year-end summary of your listening habits. A data pipeline feeds that application. Your pipeline might power something similar.

Data analysts dig deeper. They write SQL queries to answer complex questions. What percentage of customers browsed more than 5 times in the last 2 weeks but never bought anything? That kind of thing.

Data scientists build machine learning models. They need access to large, diverse datasets and tools like SparkML or SageMaker. Their requirements are usually the broadest in terms of data access.

During the whiteboarding session, you ask each group what they need, what tools they prefer, and whether there are existing corporate standards. You do not need to finalize every tool choice. Just get enough information to sketch out the right side of your architecture.

Map Your Data Sources

Next, you flip to the left side of the whiteboard. Where is the data coming from?

For each data source, you need to capture:

  • What system stores it? A MySQL database? Files on a server? A SaaS platform like Salesforce?
  • Who owns the system and the data? These are often different people.
  • How often does the data need to be ingested? Once a day? Every hour? Real-time streaming?
  • What format is the raw data in? CSV, JSON, database tables, log files?
  • Any governance concerns? PII, sensitive data, compliance requirements?

You might also note potential ingestion tools. For a relational database, AWS Database Migration Service (DMS) could replicate data to S3. For log files, a Kinesis Agent could stream them to Kinesis Data Firehose. For third-party data, AWS Data Exchange might deliver CSV files daily.

But remember, the goal of the whiteboard session is not to make final technical decisions. It is to get a high-level picture everyone agrees on.

Plan Your Transformations

The middle of the whiteboard is where transformations live. Raw data goes in on the left, consumer-ready data comes out on the right. In between, things happen.

The chapter covers the most common transformations:

File format optimization. Raw data often arrives as CSV, JSON, or XML. These are readable by humans but slow for analytics. Converting to Apache Parquet, a columnar binary format, can dramatically speed up queries and reduce storage costs.

Data standardization. Different source systems call the same thing different names. One system calls it DOB, another calls it dateOfBirth, another calls it birth_date. Dates might be mm/dd/yy or dd/mm/yyyy. Standardization means picking one convention and making everything consistent.

Data quality checks. Before you trust data, verify it. Are there missing values? Do the numbers make sense? Quality checks catch problems before they pollute downstream analytics.

Data partitioning. When you store data in S3, you can organize it into prefixes by date, region, or other frequently queried fields. When someone queries sales for January 2026, the query engine only reads the January 2026 partition instead of scanning everything. This saves time and money.

Data denormalization. Relational databases normalize data into many small tables linked by keys. For analytics, joining tables on every query is expensive. Denormalization combines related tables into wider, flatter tables that are faster to query.

Data cataloging. Every dataset in the lake should be registered in the data catalog with business metadata: who owns it, where it came from, how sensitive it is. AWS Glue Data Catalog handles this on AWS.

On the whiteboard, these transformations map to zones in your data lake. A typical setup has three zones:

  • Landing zone – raw data as ingested, in its original format.
  • Clean zone – data after quality checks, standardization, and format conversion to Parquet.
  • Curated zone – data after denormalization, enrichment, and business-specific transformations, partitioned for efficient querying.

Do You Need a Data Mart?

Many analytics tools like Athena and SageMaker can query data directly from S3. But sometimes you need faster performance or more structured schemas. That is where data marts come in.

A data mart is usually a data warehouse like Amazon Redshift with its own local storage and compute. You load a subset of your curated data lake data into Redshift for use cases that need low-latency queries across large, heavily joined datasets. Your BI tool connects to Redshift instead of querying S3 directly.

Not every pipeline needs a data mart. The whiteboarding session should surface whether one is necessary based on performance requirements and how data consumers plan to access the data.

The Hands-On Exercise: Project Bright Light

The chapter wraps up with a great hands-on scenario. A fictional company called GP Widgets Inc. holds a whiteboarding meeting for “Project Bright Light,” an analytics initiative for their marketing team.

The meeting includes VPs, team managers, a data engineer lead named Shilpa, and stakeholders from database administration, web server operations, data analytics, and data science. Over the course of the meeting, Shilpa maps out:

  • Consumers: Marketing specialists wanting real-time dashboards, data analysts running SQL queries on customer behavior, and a data science team building weather-based product popularity models.
  • Sources: On-premises SQL Server databases for customer, product, returns, and sales data. Apache HTTP Server clickstream logs from four web servers. Weather data from AWS Data Exchange.
  • Ingestion: AWS DMS for database replication (daily batch). Kinesis Agent plus Kinesis Firehose for streaming web logs. Direct S3 delivery for weather CSV files.
  • Transformations: Quality checks, Parquet conversion, data standardization into a clean zone. Then denormalization, enrichment with weather data, and partitioning into a curated zone. AWS Glue Data Catalog for metadata throughout.
  • Data marts: Potential use of Redshift for powering the BI dashboards.

The exercise asks you to create your own whiteboard architecture from the meeting notes and then compare it to Shilpa’s. It is a practical way to develop the skill of translating business conversations into technical architecture.

Key Takeaway

Chapter 5 is about something that does not get enough attention in tech: planning before building. The whiteboard session is not a formality. It is where you align stakeholders, surface requirements you did not know about, identify risks, and create a shared understanding of what you are building.

The biggest lesson here is to work backward from the consumer. Start with who needs the data, what they need it for, and how they want to access it. Everything else follows from there.

Next chapter, we start getting our hands dirty with actual data ingestion on AWS.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 4 Part 2 - Data Governance Next: Chapter 6 Part 1 - Ingesting Batch Data

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More