Data Engineering with AWS Chapter 7 Part 1: Transforming Data - The Basics
This is post 11 in my Data Engineering with AWS retelling series.
You have data sitting in your data lake. Raw CSV files, JSON dumps, database extracts. It is all there, technically available, but trying to run analytics on it is painfully slow and expensive. This chapter is about fixing that. Transforming raw data into something optimized, clean, and ready for actual use.
And honestly, this is where data engineering gets fun. You take messy ingredients and turn them into something useful.
Baking a Cake with Data
Gareth Eagar opens the chapter with a cooking analogy that works. Sugar, butter, eggs, and milk are fine on their own. But combine them with flour and baking powder in just the right way, and you get a cake that does not resemble any individual ingredient.
Data works the same way. Individual datasets have some value alone. But combine them through the right transformations and you create something entirely new. A sales dataset joined with weather data might reveal that umbrella sales spike three days before a storm, not during one. That insight did not exist in either dataset alone. And how you combine data depends on the business question, just like the same ingredients can become cake or fried eggs.
The Transformation Engines
Before you transform anything, you need to pick your tool. The chapter covers four main approaches.
Apache Spark is the big one. An in-memory engine that splits datasets across cluster nodes for parallel processing. On AWS, run Spark through AWS Glue (serverless) or Amazon EMR (managed cluster). Spark handles batch, near real-time streaming, SQL queries via Spark SQL, and machine learning with Spark ML.
Hadoop and MapReduce came before Spark and still power many legacy systems. The key difference: Spark processes in memory, MapReduce reads and writes to disk. Spark wins on performance for most use cases. Amazon EMR supports both.
SQL is the most accessible option. If your org is SQL-heavy, it works well. Spark SQL lets you write SQL that runs on the Spark engine. AWS Glue Studio lets users design SQL-based transforms visually while Spark handles execution.
GUI-based tools lower the barrier further. AWS Glue DataBrew offers 250+ built-in transformations through a visual interface. AWS Glue Studio provides drag-and-drop design that generates Spark code. Outside AWS, tools like Informatica, Matillion, and Talend offer similar capabilities.
Why Parquet Changes Everything
The single most impactful transformation you can apply to raw data is converting it from row-based formats like CSV or JSON into Apache Parquet, a columnar file format.
What does columnar mean? In a CSV file, data is stored row by row. To read a single column, the engine has to scan through every row. In a Parquet file, data is grouped by columns. If your query only needs three columns out of fifty, the engine reads just those three column chunks and skips the rest.
The numbers are dramatic. A 1 TB CSV dataset can compress down to roughly 130 GB in Parquet format. That is not a typo. Parquet uses compression algorithms like Snappy to shrink data significantly.
Parquet files also store metadata, including schema information, row counts, and min/max values per column chunk. This is where things get interesting for query costs. Amazon Athena charges about $5 per TB of data scanned. If you query a Parquet file and only need a few columns, you scan way less data. If you just want a row count, Athena reads the metadata without scanning any data at all. Zero bytes scanned. Zero cost.
Even smarter: if your query filters for sales above $10,000, Athena reads each column chunk’s metadata first. If the max value in a chunk is $8,000, Athena skips that chunk entirely. Less data scanned means faster queries and lower bills.
Converting incoming CSV, JSON, and XML files to Parquet format is one of the first transformations you should set up in any data pipeline.
Partitioning Your Data
File format is half the optimization story. The other half is how you organize files in storage.
Hive partitioning splits data from a single table into different folders based on one or more column values. The most common strategy is partitioning by date. Here is what it looks like in S3:
datalake_bucket/year=2021/file1.parquet
datalake_bucket/year=2020/file1.parquet
datalake_bucket/year=2019/file1.parquet
When someone runs a query with WHERE year = 2019, the analytics engine only opens the files in the 2019 folder. Everything else is skipped. For multi-level partitioning, you can go deeper:
datalake_bucket/year=2021/month=6/day=1/file1.parquet
This works great for daily-level queries while still supporting month and year filters.
But there are traps. If you partition by year but most queries filter by business unit, your partitioning does nothing. Queries that do not use partitioned columns may actually run slower because the engine has to navigate through all the partition folders.
Another common mistake is over-partitioning. If you partition by year, month, day, and hour but each partition only has a tiny file, you end up with thousands of small files. The optimal Parquet file size is 128 MB to 1 GB. Having lots of small files creates overhead for opening, reading metadata, and closing each file. That overhead can hurt performance worse than not partitioning at all.
The lesson: partition based on how people actually query the data. If you do not know the query patterns yet, it may be better to skip partitioning until you do.
Protecting PII Data
Before any other transformation, data containing personally identifiable information needs to be handled. PII columns like names, social security numbers, or email addresses often cannot be stored openly in the data lake due to governance rules.
Common approaches include tokenization and hashing. The idea is to replace PII values with tokens that still allow analytics without exposing the actual personal data. This transformation typically runs in a separate, locked-down zone of the data lake, or even in a separate AWS account, with strict access controls. Once anonymized, the data moves to the general data lake raw zone.
A quick SHA-256 hash might seem like an easy solution, but courts in some countries have ruled it insufficient for privacy compliance. For serious anonymization, purpose-built tools like PKWARE’s PK Privacy running on AWS offer more robust protection.
Data Cleansing
Format optimization and partitioning change the structure of your data but do not touch the actual values. Data cleansing does. It is about making sure your data is valid, accurate, consistent, complete, and uniform.
Common cleansing tasks include:
- Standardizing column names. One source calls it
date_of_birth, another calls itbirthdate. Pick one and rename across all datasets. - Fixing data types. If a column should contain integers but someone entered a string somewhere, replace that value with null so queries do not break.
- Standardizing formats. Dates are the classic problem. Is it
MM-DD-YYYYorDD-MM-YYYY? Pick a corporate standard and convert everything. - Removing duplicates. Streaming data may deliver the same record more than once. Identify and remove or flag duplicates.
- Handling missing values. Replace nulls with averages, medians, empty strings, or just drop the rows entirely, depending on the use case.
Data from relational databases tends to arrive fairly clean. Data from web forms, surveys, manually entered fields, or IoT sensors is another story. That stuff needs serious scrubbing.
Within AWS, Glue DataBrew provides over 250 built-in cleansing transformations through a visual interface, covering the most common data quality tasks without writing code.
Key Takeaway
The transformations covered here, converting to Parquet, partitioning, PII protection, and data cleansing, are the generic, foundational ones. They do not require deep knowledge of the business use case. They prepare individual datasets so that downstream, more business-specific transformations can work efficiently.
Think of it as prep work in a kitchen. Before you start following a recipe, you wash the vegetables, sharpen the knives, and preheat the oven. These data preparation transforms are the same idea. They get your data into a state where the real magic can happen.
In Part 2, we get into the business-specific transforms: denormalization, enrichment, pre-aggregation, handling CDC data, and a hands-on exercise with AWS Glue Studio.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 6 Part 2 - Ingesting Streaming Data Next: Chapter 7 Part 2 - Optimization and Business Logic