Data Engineering with AWS Chapter 3 Part 2: The AWS Toolkit - Analytics and Processing
In Part 1 we covered how data gets into AWS. Now comes the good part: what do you actually do with it once it is there? This post covers the services for transforming raw data, orchestrating multi-step pipelines, and letting people query and visualize the results.
This is post 5 in my Data Engineering with AWS retelling series.
Transforming Data: Making the Raw Stuff Useful
Some ingestion tools like DMS and Kinesis Firehose can do light transformations (converting to Parquet, for example). But real data pipelines need heavier processing – cleaning, joining, aggregating, reshaping. AWS has two main services for this.
AWS Lambda: Quick and Serverless
AWS Lambda lets you run code without managing any servers. You write a function, define a trigger (like “a new file appeared in this S3 bucket”), and Lambda runs your code automatically. You pay only for the time your code actually executes, billed by the millisecond.
For data engineering, Lambda is perfect for lightweight tasks:
- Validate that an incoming CSV file is properly formatted
- Unzip a file and process each file inside it
- Convert CSV to Parquet and update a catalog entry
- Run a calculation on incoming data and update a database
Lambda supports up to 15 minutes of execution time and 10 GB of memory. It is also massively parallel. If 500 files land in your S3 bucket at the same time, Lambda spins up 500 separate instances to process them all simultaneously. The default concurrency limit is 1,000 per region, but you can request increases into the hundreds of thousands.
Lambda supports Python, which has become the go-to language for data engineering work.
AWS Glue: The Swiss Army Knife
AWS Glue is actually three things bundled together, and understanding each piece matters.
Glue ETL Engine
At its core, Glue gives you a serverless environment for running data transformations. You get two options:
- Glue Python Shell – a single-node Python environment. Good for small to medium datasets where you do not need distributed processing.
- Glue Spark – a multi-node Apache Spark cluster. Spark splits your data across multiple machines and processes it all in memory, which makes it extremely fast for large datasets.
You do not manage any servers. You just tell Glue how many Data Processing Units (DPUs) you want, submit your code, and it runs. You pay for DPUs multiplied by execution time.
Glue Data Catalog
The Data Catalog is a metadata store that gives you a logical view of your data. Imagine DMS replicates your HR database to S3. You end up with a bunch of CSV files scattered across S3 prefixes. The Glue Data Catalog organizes all of this into databases and tables, just like a traditional database. It stores column names, data types, and the S3 location where the actual data lives.
The real power: once data is cataloged, services like Athena, EMR, and Glue ETL can all reference those tables directly. The catalog is Hive metastore-compatible, which means it works with a wide ecosystem of tools beyond just AWS.
Glue Crawlers
Do not want to manually catalog everything? Glue Crawlers can scan your S3 data, figure out the file format (CSV, Parquet, JSON), infer the schema (column names and types), and automatically populate the Data Catalog. Point a crawler at your S3 path, run it, and your data shows up as a queryable table.
You can also add tables manually using the Glue API or SQL statements in Athena. Crawlers are convenient, not required.
Amazon EMR: Full Hadoop Ecosystem
Amazon EMR (Elastic MapReduce) is a managed platform for running big data frameworks like Apache Spark, Hive, HBase, Presto, and Pig. If Glue is the easy-mode Spark environment, EMR is the full-control version.
So why do both exist? It comes down to control vs. convenience:
| AWS Glue | Amazon EMR | |
|---|---|---|
| Setup | Serverless, minimal config | You configure the cluster |
| Cost | Higher per-unit, less management | Lower per-unit, more management |
| Tuning | Limited options | Full control over Spark settings |
| Frameworks | Spark and Python only | Spark, Hive, Presto, HBase, Pig, and more |
| Best for | Simple Spark jobs, quick ETL | Complex workloads, existing Hadoop teams |
If your team knows Spark inside and out and needs to fine-tune everything, go with EMR. If you just want to run some Spark code and get results, Glue will save you a lot of headaches.
Orchestrating Pipelines: Making It All Work Together
Real data pipelines have many steps: ingest, crawl, transform, catalog again, validate, load. You need something to coordinate all of this. AWS gives you three main options.
Glue Workflows
If your entire pipeline uses only Glue components (crawlers and ETL jobs), Glue Workflows keeps things simple. You define an ordered sequence – run a crawler, then run a Spark job, then run another crawler. Each step can pass state information to the next one. But it only works with Glue components, nothing else.
AWS Step Functions
Step Functions is the general-purpose orchestrator. It is serverless and can integrate with almost any AWS service. You define workflows (state machines) using JSON or a visual drag-and-drop editor. Your workflow can:
- Run Lambda functions
- Trigger Glue jobs
- Make choices based on results (if job failed, go to error handler)
- Wait for a period before continuing
- Loop back to previous steps
A common pattern: S3 upload triggers Step Functions, which runs a Glue job to convert CSV to Parquet, checks if it succeeded, runs a Glue Crawler to catalog the output, and sends a notification if anything fails. Billing is consumption-based, so you only pay when workflows actually run.
Amazon MWAA: Managed Apache Airflow
Managed Workflows for Apache Airflow (MWAA) is for teams that already know and love Airflow. Apache Airflow was created at Airbnb in 2014 and became a top-level Apache project in 2019. It lets you define pipelines as Python code and provides a web UI for monitoring.
MWAA handles the deployment, scaling, and upgrades. But unlike Step Functions, it is not serverless. You pick an environment size (small, medium, large) and pay a base monthly fee whether you run one job or a thousand. Additional workers auto-scale as needed.
When to use MWAA: Your team already uses Airflow, or you need the wide range of third-party integrations Airflow supports (AWS, Azure, GCP, and more).
When to use Step Functions: You are building something new and want pay-per-use pricing with deep AWS integration.
Consuming Data: Getting Answers Out
After all that ingesting, transforming, and orchestrating, someone actually needs to use the data. AWS offers different tools for different types of consumers.
Amazon Athena: SQL on Your Data Lake
Amazon Athena is serverless SQL for your data lake. Once your data is in S3 and cataloged in the Glue Data Catalog, anyone can run SQL queries against it without setting up any infrastructure. No databases to manage, no servers to provision. Just write SQL and get results.
Athena connects through JDBC or ODBC drivers, so you can use it with tools like SQL Workbench or any BI tool that supports these standard connections.
The killer feature is Athena Federated Query. It lets you write a single SQL statement that queries data from S3, DynamoDB, PostgreSQL, and CloudWatch Logs all at once. One query, multiple data sources.
Amazon Redshift: The Data Warehouse
Amazon Redshift is AWS’s cloud data warehouse, launched in 2012 and still one of their most popular services. It is built for OLAP (Online Analytical Processing) workloads – the kind of queries that scan millions of rows, join multiple tables, and aggregate results.
Think questions like: “What was the average sale amount by ZIP code last month?” or “Which products saw a 20% increase in sales between Q4 and Q1?” Redshift’s clustered architecture distributes the work across multiple compute nodes to answer these queries fast.
A common pattern is to load the last 12 months of data into Redshift for fast queries, while keeping historical data in S3. Redshift Spectrum bridges the gap – it lets you write a single query that hits both Redshift tables and S3 data lake tables (via the Glue Data Catalog). The Spectrum layer uses thousands of worker nodes to scan, filter, and aggregate S3 data, then streams results back to your Redshift cluster for final processing.
Amazon QuickSight: Dashboards and Visuals
Most business users do not want to write SQL. They want charts. Amazon QuickSight is the AWS visualization service that turns data into interactive dashboards, bar graphs, and drill-down reports.
A sales manager can glance at a dashboard to compare quarterly performance across territories, filter by segment, and drill down to monthly detail. QuickSight is serverless and charges a simple per-user monthly fee (authors who create visuals, and readers who view them).
QuickSight can pull data from Athena, Redshift, S3, and other sources, making it the presentation layer on top of everything we have built.
Hands-On: Lambda Converting CSV to Parquet
The chapter ends with a practical exercise that ties several services together. Here is the flow:
- Create two S3 buckets: a landing zone (raw files) and a clean zone (processed files)
- Create a Lambda function that uses the AWS Data Wrangler library (a Python library from AWS that simplifies common ETL tasks)
- Configure an S3 trigger so the Lambda fires whenever a
.csvfile lands in the landing zone - The Lambda reads the CSV, converts it to Parquet, writes it to the clean zone, and registers it in the Glue Data Catalog
The core Lambda code looks like this:
import awswrangler as wr
def lambda_handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
# Derive database and table names from the S3 path
key_list = key.split("/")
db_name = key_list[-3]
table_name = key_list[-2]
input_path = f"s3://{bucket}/{key}"
output_path = f"s3://your-clean-zone-bucket/{db_name}/{table_name}"
# Read CSV into a DataFrame
input_df = wr.s3.read_csv([input_path])
# Create Glue database if it does not exist
current_databases = wr.catalog.databases()
if db_name not in current_databases.values:
wr.catalog.create_database(db_name)
# Write as Parquet and register in Glue Catalog
result = wr.s3.to_parquet(
df=input_df,
path=output_path,
dataset=True,
database=db_name,
table=table_name,
mode="append"
)
return result
You can test it by uploading a simple CSV:
aws s3 cp test.csv s3://your-landing-zone-bucket/testdb/csvparquet/test.csv
If everything is wired correctly, a Parquet file appears in your clean zone and a new table shows up in the Glue Data Catalog. That is a working mini-pipeline.
The Big Picture
Chapter 3 covered a lot of ground. Here is the mental model to take away:
Ingest (DMS, Kinesis, AppFlow, Transfer Family, DataSync, Snow) –> Transform (Lambda, Glue, EMR) –> Orchestrate (Glue Workflows, Step Functions, MWAA) –> Consume (Athena, Redshift, QuickSight)
Every real pipeline is some combination of these layers. The next chapter dives into something that touches every layer: data cataloging, security, and governance. Not the most exciting topic on paper, but absolutely critical to getting data engineering right.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 3 Part 1 - Storage and Databases Next: Chapter 4 Part 1 - Data Cataloging and Security