Data Engineering with AWS Chapter 7 Part 2: Transforming Data - Optimization and Business Logic

This is post 12 in my Data Engineering with AWS retelling series.

In Part 1, we covered the generic data preparation transforms: converting to Parquet, partitioning, PII protection, and data cleansing. Those transforms work on individual datasets and do not need much business context. Now we get to the transforms that actually create business value. The ones that combine multiple datasets, add context, flatten structures, and produce the tables that analysts and dashboards consume.

This is where data engineering stops being plumbing and starts being architecture.

Data Denormalization

Relational databases split data into many small, focused tables linked by foreign keys. A customer table holds customer info. A salesperson table holds salesperson info. To find which salesperson supports a customer, you join the two tables on sales_person_id. Great for transactional systems. Terrible for analytics when you are joining millions of rows across multiple tables.

Denormalization pre-joins tables into wider, flatter structures. Instead of joining two tables every query, you create one table with all the fields already combined. One query, no joins. An analytics project might have tens or hundreds of these transforms. The key is understanding how the data will be used, then determining which joins make sense.

Enriching Data with Third-Party Sources

Your organization’s data is valuable. But combining it with external data can make it dramatically more valuable.

A company that markets credit cards might purchase a database of consumer credit scores and match it against their customer records. A retailer that knows weather affects sales might subscribe to historical and forecast weather data. These enrichment joins combine company-owned datasets with third-party data to create richer, more insightful tables.

AWS Data Exchange is a marketplace with over 1,000 datasets available via subscription, including both paid and free options. Once you subscribe, data loads directly into your S3 landing zone through the Data Exchange API. From there, a transformation pipeline joins the external data with your internal datasets.

Pre-Aggregating for Performance

Data lakes are great for storing detailed, granular data cheaply. You can keep years of raw transaction records and run ad-hoc queries whenever a new question comes up. But when the business has specific questions they ask repeatedly, running heavy computations on raw data every single time is wasteful.

Pre-aggregation solves this. You create transform jobs that run on a schedule, maybe nightly, to compute summary tables. For example, your denormalized sales table might have tens of millions of rows per day with hundreds of columns. A pre-aggregation job could compute sales by category at the store, city, and state level and write those results to new, much smaller tables.

Now hundreds of store managers can pull up their dashboards and see category-level sales without each dashboard query crunching through millions of raw rows. The heavy computation happens once during the nightly batch job.

Extracting Value from Unstructured Data

Data lakes also hold unstructured data like images, audio files, and video. You cannot run SQL against a photo. But you can use AI services to extract structured metadata from unstructured files.

A real estate company could use Amazon Rekognition to classify room types in property photos. A company recording customer service calls could use Amazon Transcribe for transcripts, then Amazon Comprehend for sentiment analysis. Join sentiment data with customer records and you have a targeted marketing list. AI turns seemingly useless files into valuable structured datasets.

The Challenge of Change Data Capture

One of the hardest problems in data lake management is handling updates. When you replicate a database into a data lake using Amazon DMS, the initial load is straightforward. But after that, DMS captures ongoing changes: inserts (I), updates (U), and deletes (D). Each change is written to S3 as a CDC file.

Here is the problem. S3 is object storage. You can delete and replace entire files, but you cannot edit a single row inside a file. If you just append CDC records to the existing data, you end up with multiple copies of the same record reflecting different points in time.

Traditional approach one: upserts. A Spark job reads the existing dataset, reads the CDC data, merges them with custom logic, and overwrites the original. Works, but gets expensive as datasets grow.

Traditional approach two: SQL views. An Athena view combines the base table and CDC table on the fly, returning the latest state at query time. Avoids nightly merges but performance degrades as the CDC table grows.

Both require custom solutions per organization, which is why transactional data lake formats emerged.

Transactional Data Lakes

Over the past few years, several projects have brought ACID properties to data lakes, properties that were previously only available in traditional databases:

  • Atomicity – writes are all-or-nothing
  • Consistency – failures do not leave data in a broken state
  • Isolation – concurrent transactions do not interfere with each other
  • Durability – completed transactions survive system failures

AWS Lake Formation governed tables were announced in December 2020, providing ACID transactions for S3-based data lakes. Lake Formation handles the complexity of concurrent readers and writers, plus automatic file compaction.

Apache Hudi originated at Uber for their massive data lake and became a top-level Apache project in 2020. It supports efficient upserts and low-latency queries against updated data. Runs on Amazon EMR.

Apache Iceberg was created by engineers at Netflix and Apple. It supports schema evolution, time travel queries, atomic table changes, and concurrent writers. Became a top-level Apache project in 2020.

Databricks Delta Lake is an open-format storage layer from the creators of Apache Spark. It provides ACID transactions, time travel, and support for both streaming and batch operations. Available as both open source and a commercial offering.

These solutions do not replace traditional databases. You are not going to run your CRM on a data lake. But they dramatically simplify the process of applying updates and deletes to analytical datasets, something that used to require painful custom engineering.

Hands-On: Denormalization with AWS Glue Studio

The chapter wraps up with a practical exercise that ties the concepts together. You build two Glue transformation jobs that take normalized film data (migrated from a MySQL database in Chapter 6) and streaming data, then produce denormalized tables ready for analytics.

Setting up the curated zone. Create an S3 bucket dataeng-curated-zone-<initials> and Glue database curatedzonedb. This is the third zone: landing (raw), clean (format-optimized), and now curated (business-ready).

Creating an IAM role. The Glue job needs read access to landing and clean zone buckets, write access to curated zone, plus Glue temp directories and CloudWatch logs. Custom policy plus the managed AWSGlueServiceRole.

Job 1: Film Category Denormalization. The goal is to flatten three normalized tables into one. The film table has movie details but no category info. The category table has category names. The film_category linking table connects films to categories. Using Glue Studio’s visual designer, you:

  1. Add three S3 data sources (film, film_category, category)
  2. Join film with film_category on film_id using a left join
  3. Apply a mapping to drop irrelevant columns like rental_duration and rental_rate
  4. Join the result with the category table on category_id
  5. Apply another mapping to drop duplicates and rename name to category_name
  6. Write the output to S3 as Parquet with Snappy compression
  7. Configure Glue to create a new table in curatedzonedb

The job runs on 2 workers, the minimum. Output lands in s3://dataeng-curated-zone-<initials>/filmdb/film_category/.

Job 2: Streaming Data Film Enrichment. This job joins the newly created denormalized film catalog with streaming data showing which movies viewers watched. You join on film_id and write the result as Parquet to s3://dataeng-curated-zone-<initials>/streaming/streaming-films/.

The result is a single table containing every stream event enriched with film details and category names. You can now query which movie categories are most popular, break it down by state, and analyze viewing patterns without any runtime joins.

Key Takeaway

Chapter 7 completes the transformation layer of the data pipeline. The flow is: raw data lands in the landing zone, gets format-optimized and cleansed in the clean zone, then gets denormalized and enriched in the curated zone. Each step adds value and brings the data closer to being useful for actual business decisions.

The biggest lesson: transformations split into two distinct categories. Generic ones like Parquet conversion can be applied without business context. Business-specific ones like denormalization require deep understanding of what questions the organization is trying to answer. Skip either and you pay the price in slow queries or analyst frustration.

Next chapter, we look at the other end of the pipeline: the data consumers who actually use everything we have built.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 7 Part 1 - Transforming Data Basics Next: Chapter 8 - Data Consumers

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More