Data Engineering with AWS Chapter 14: Wrapping Up the Learning Journey

This is post 20 in my Data Engineering with AWS retelling series.

We made it. Thirteen chapters of pipelines, transforms, orchestration, security, querying, visualization, and machine learning. Chapter 14 is the final chapter. It does not introduce a new AWS service. Instead, it zooms out. Way out. It looks at how data engineering works in the real world, shows you case studies from Spotify and Netflix, and points you toward where the field is heading next.

Think of it as the conversation you would have with a senior engineer after your first year on the job. The tools matter. But the processes, the patterns, and the ability to keep learning matter more.

The Big Picture: Teams, Environments, and DataOps

Throughout this book, we worked as a single person clicking buttons in one AWS account. Real companies do not operate that way. Data engineering teams have multiple people working on different parts of the pipeline, and most organizations run at least three environments: development, test (QA), and production.

Managing all of this by hand is a recipe for disaster. That is where DataOps comes in.

DataOps borrows heavily from DevOps. If you have worked in software engineering, the ideas will feel familiar. The core pieces are:

  • Infrastructure as Code (IaC). Instead of clicking through the AWS console, you write template files (YAML or JSON) that describe everything you want to deploy. AWS CloudFormation reads those templates and builds the infrastructure. Same template, same result, every time.

  • Source control. All your code – transformation scripts, orchestration definitions, infrastructure templates – lives in a repository like CodeCommit, GitLab, or BitBucket. Every change is tracked.

  • CI/CD (Continuous Integration / Continuous Delivery). When someone commits new code, automated tests run immediately. If they pass, the code gets deployed to the test environment. More tests run. If those pass too, someone approves deployment to production. If anything breaks, automated rollback kicks in.

The opposite of DataOps is someone manually copying files to production at 2 AM. DataOps exists so we never have to do that again.

Real-World Pipeline: Spotify Wrapped

Every December, Spotify shows you your year in review: top artist, top track, top genre, total minutes streamed. In 2019, they went bigger – a full decade of listening history for 248 million monthly active users. That is a lot of data to crunch.

They had learned hard lessons from 2018, when they needed to work closely with Google (their cloud provider) just to handle the scale. For 2019, with ten times the data scope, they rethought the architecture.

The key insight was treating each statistic as a separate data story. Your top artist is one story, your top track is another. Each became its own decoupled job that could run independently and in parallel.

For storage, they used Google BigTable (similar to AWS DynamoDB). Every user had a single row with columns for each data story for each year of the decade. When a job finished computing “top artist for 2015,” it wrote to the correct column in the correct row. Once all jobs finished, the full picture was ready to serve.

Three lessons from this:

  1. Iterate on your architecture. What worked last year might not work this year. Spotify rewrote their approach almost every year based on what they learned.
  2. Break big jobs into small, modular pieces. One giant job that does everything is fragile. Many small, decoupled jobs are easier to debug, scale, and run in parallel.
  3. Use the right tool for the job. A NoSQL database was the right fit here, even though most of the book focused on data lakes and warehouses.

Real-World Pipeline: Netflix VPC Flow Logs

Netflix runs on AWS with over 200 million subscribers and countless microservices. They need to understand how network traffic flows between all their systems. AWS provides VPC Flow Logs for this – records of network communication between interfaces in a Virtual Private Cloud.

The problem? AWS uses dynamic IP addresses. Raw flow logs just show “IP A talked to IP B.” Without knowing which application owned each IP at that moment, the logs are nearly useless. Netflix built an internal system called Sonar that tracks IP address changes, so they could enrich every flow log with application metadata.

Originally, they ran a 1,000-shard Kinesis Data Streams cluster for this. Then in 2018, AWS updated VPC Flow Logs so logs could be delivered directly to S3. Netflix re-architected.

The new approach used S3 as the landing zone and SQS queues to coordinate processing. When new flow log files landed in S3, event notifications went to an SQS queue. A Lambda function grouped files into optimally-sized batches – what Netflix called a “mouthful of files” – by reading only file size metadata from the SQS messages.

But they hit a wall. Amazon SQS has a default limit of 120,000 messages in flight at any time. Their solution was clever: use two SQS queues. The first queue held individual file notifications. A Lambda quickly read from this queue and wrote a single message to a second queue listing all files in one batch. If a batch contained 100 files on average, the second queue had 99% fewer messages. The Spark jobs read from the second queue. Problem solved.

Two lessons from this:

  1. Stay current with AWS updates. A new feature in VPC Flow Logs let Netflix potentially eliminate a 1,000-shard Kinesis cluster. That is a significant cost and complexity reduction.
  2. Know your service quotas. Every AWS service has limits. Some are soft limits you can raise by contacting support. Some are hard limits. Either way, you need to design around them.

Chapter 14 closes with a look at where data engineering is heading. The book was published in 2021, so some of these “emerging” trends have progressed since then, but the directions remain relevant.

ACID Transactions on Data Lakes

Traditionally, data lakes had a big weakness: no support for ACID transactions. If two processes tried to write to the same dataset at the same time, things could get messy. You could not easily update or delete individual records either.

Technologies like Delta Lake (from Databricks), Apache Hudi, and AWS Lake Formation Governed Tables fix this. They bring database-style transactional guarantees to data stored in S3. This is a big deal because it means data lakes can handle use cases that previously required a traditional database or data warehouse.

More Streaming, Less Batch

The volume of data keeps growing, and more of it is arriving in real time. IoT sensors, point-of-sale devices, event-driven architectures, social media feeds. Batch processing is not going away, but streaming is becoming a larger share of the pie. If you are starting your data engineering career, investing time in streaming technologies like Kinesis, Kafka, and Flink will pay off.

Multi-Cloud

This book focuses on AWS, but many large organizations use more than one cloud provider. They might run workloads on AWS, Azure, and Google Cloud simultaneously. That introduces challenges: different service names, different APIs, different pricing models, different quirks. Data engineers increasingly need to be comfortable working across multiple clouds, not just one.

Data Mesh

This is the big conceptual shift. In 2019, Zhamak Dehghani from ThoughtWorks published a blog post that questioned the entire model of centralized data engineering teams.

The traditional model works like this: every department sends its data to a central data lake, and a central data engineering team processes everything. The problem is that the central team becomes a bottleneck. They do not have deep domain knowledge about every department’s data. They cannot keep up with every team’s requests.

Data mesh flips this. Instead of a central team owning all data processing, each business domain (sales, marketing, logistics, whatever) owns its own data as a data product. The domain team that generates the data is also responsible for cleaning it, transforming it, and making it available to others.

A centralized platform team still exists, but their job changes. Instead of processing data, they provide the infrastructure: managed Spark environments, data catalogs, governance controls, access management. They build the platform. Domain teams use the platform to serve their own data products.

It is like the difference between a company cafeteria (central kitchen feeds everyone) and a food court (each vendor makes their own food, the building provides the space and utilities). Both models work. Data mesh argues the food court scales better.

Closing Your AWS Account

The hands-on section is cleanup, not a new lab. The book walks you through checking your AWS billing dashboard for leftover charges (stopped EC2 instances with attached volumes, forgotten RDS snapshots, expired QuickSight trials) and optionally closing your account entirely. Practical advice: always check your billing console before walking away. Forgotten resources have a way of quietly running up charges.

Key Takeaway

Chapter 14 is a reminder that tools are only part of the story. Knowing how to use Glue, Step Functions, and Athena matters. But knowing how to deploy changes safely through CI/CD, how to design pipelines that scale like Spotify and Netflix, and how to stay current as the field evolves – that is what separates a junior data engineer from a senior one.

The book ends with a simple message: this is just the beginning. Data engineering is a field that changes fast. The best thing you can do is keep building, keep reading, and keep learning.

That goes for all of us.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 13: Enabling AI and Machine Learning Next: Closing Thoughts on the Data Engineering with AWS Series

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More