Data Engineering with AWS Chapter 1: What Even Is Data Engineering?

If someone told you twenty years ago that data would become more valuable than oil, you would have laughed. But here we are. The most valuable companies on the planet are not drilling for crude. They are collecting, processing, and squeezing insights out of massive piles of data. And behind every one of those companies, there is a team of data engineers making it all work.

This is post 2 in my Data Engineering with AWS retelling series. If you missed the intro, start there.

Chapter 1 of Gareth Eagar’s book sets the stage. It answers the basic questions: why does data matter so much now, what problems do companies face when they try to use it, and who are the people that actually make data useful? Let us break it all down.

Data Became the Most Valuable Thing a Company Owns

Look at the top companies by market value. Microsoft, Apple, Google, Amazon, Tesla. What do they all have in common? They are insanely good at collecting and using data. Compare that to a couple decades ago when the list was dominated by oil and gas giants like ExxonMobil. Those companies are barely in the top 30 now.

The shift happened because data, when used well, gives you a massive edge. TikTok uses data to figure out which video to show you next. Amazon uses your purchase history to recommend products you did not even know you wanted. Healthcare and finance are mining data to find patterns the human eye would never catch.

Every organization today falls into one of three camps:

  1. They already have a solid data analytics program and it is giving them an edge over competitors.
  2. They are running proof-of-concept projects to see if data analytics can help them.
  3. Their executives are losing sleep worrying that competitors are already doing it better.

No matter which camp you are in, the path forward requires people who can build the infrastructure to handle all that data. Enter the data engineer.

The Problem: Data Grew Faster Than Our Ability to Handle It

Companies have always had data. Customer records, sales numbers, inventory counts. But for a long time, that data lived in separate databases that did not talk to each other. Think of it like having a hundred filing cabinets in a hundred different offices and nobody has a master key.

A company starts with one database, then grows. More products, more customers, more teams. With modern microservices, companies now routinely have hundreds or thousands of databases. Each one is its own little silo.

To solve this, companies built data warehouses: central locations where you pull data from all those separate databases so you can analyze it in one place. Great idea, but traditional data warehouses were expensive. So companies could only keep a subset of data. They aggregated numbers instead of keeping raw details. They deleted old data because storage costs were brutal. Always making compromises.

Then came Hadoop. Created at Yahoo in the early 2000s to index a billion web pages, Hadoop gave companies a way to store and process much larger datasets. Breakthrough, yes, but running a Hadoop cluster was complex and required specialized skills.

The next leap was Apache Spark. Where Hadoop read and wrote to disk constantly, Spark did most processing in memory, making it dramatically faster. Today, Spark is the gold standard for big data processing. Hadoop clusters still exist in production, but Spark is where the momentum is.

Alongside Spark came the data lake concept. Instead of expensive proprietary storage, a data lake uses cheap object storage (like Amazon S3) to hold all your data: structured, semi-structured, unstructured, whatever. Store everything as-is and process it later with whatever tool fits the job.

No more throwing away data because storage is too expensive. No more forcing everything into rigid table structures before you can use it. Dump it all in the lake and figure out the best way to analyze it later.

The Three Key Roles in the Data World

The book uses a practical example. A sales manager wants to know: what products do customers compare before buying ours, and can we predict demand based on weather? Answering that requires pulling data from customer databases, order records, web logs, marketplace sales data, and weather datasets.

Making it happen requires three types of people.

The Data Engineer

This is the focus of the entire book. The data engineer builds the pipes. They design and maintain the pipelines that pull raw data from various sources, transform it into something useful, and make it available to everyone who needs it.

Think of a data engineer like a civil engineer for a new neighborhood. The civil engineer builds roads, bridges, and train stations so people can move around. The data engineer builds the infrastructure so data can flow from where it is created to where it needs to go.

Their day-to-day involves tools like Apache Spark, Apache Kafka, and Presto. They handle batch ingestion versus real-time streaming, build data quality checks, maintain data catalogs, and manage the lifecycle of transformation code.

The Data Scientist

If the data engineer builds the roads, the data scientist builds the cars. Data scientists use machine learning and artificial intelligence to find non-obvious patterns in data and make predictions about the future. They combine skills in computer science, statistics, and math to answer complex questions.

In our sales manager example, the data scientist might build a model that correlates past sales with weather data, then predicts which categories will sell best on future dates based on the forecast.

The Data Analyst

The data analyst is the skilled driver. They take the roads the engineer built and the vehicles the scientist created, and use them to get business users where they need to go. They run queries, join datasets together, and create reports that drive better decisions.

In our example, the analyst might create a report showing which alternative products customers browse most before buying a specific product. That insight helps the sales team make smarter calls about pricing and marketing.

In smaller companies, one person wears all three hats. In larger organizations, these are distinct teams. You will also see variations like “big data architect” or “data visualization developer,” but they are subsets of these core three roles.

Why the Cloud Changes Everything

For years, companies ran data infrastructure in their own data centers. Racks of servers, dedicated storage, teams of people just to keep the lights on. Scaling was painful. Need more storage? Order hardware, wait weeks, rack it, configure it.

AWS launched in 2006 and changed the game. One of its earliest services was Amazon S3, an object storage service with essentially unlimited scalability at low cost. S3 has become the storage backbone for thousands of data lake projects worldwide. On top of it, AWS built an entire ecosystem of analytics tools for ingestion, transformation, querying, visualization, and machine learning.

The cloud gives you three things that on-premises data centers struggle with:

  • Scalability. Need to process ten times more data next month? Scale up with a few clicks, then scale back down when the job is done.
  • Cost efficiency. Pay only for what you use. No more buying servers that sit idle 90% of the time.
  • Speed. Spin up a new analytics environment in minutes instead of weeks.

For data engineers, this means you can focus on building pipelines and solving data problems instead of babysitting hardware.

Getting Your Hands Dirty: Setting Up AWS

The rest of Chapter 1 walks through creating an AWS account. If you want to follow along with the book’s hands-on exercises, you will need an account with administrator privileges. Here is the short version:

  1. Go to aws.amazon.com and create an account.
  2. You will need an email address, phone number, and a credit or debit card.
  3. Once your account is activated, log in and create a new IAM user with AdministratorAccess instead of using the root account for daily work.
  4. Enable Multi-Factor Authentication (MFA) on your accounts. Seriously, do this.
  5. Set up billing alerts so you do not get surprised by charges.

A tip from the book: if you already used your email for an AWS account, many providers like Gmail let you add a + suffix to create a unique address. So [email protected] still delivers to your inbox but counts as a new address for registration.

Heads-up: following along will cost real money. Some services fall under AWS Free Tier, but many do not. Watch your billing dashboard.

The Big Takeaway

Data is the most valuable asset a modern company has, and the demand for people who can wrangle it keeps growing. Data engineers build the plumbing that makes everything else possible: the analytics, the dashboards, the machine learning models. The cloud, especially AWS, has made it dramatically easier to do all of this without owning a single server.

Chapter 1 is the foundation. The real fun starts in Chapter 2, where we dig into the architectures that organize all this data: data warehouses, data lakes, and the newer data lakehouse concept.

See you there.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Introduction - A Book Retelling Series for the Cloud-Curious Next: Chapter 2 - Data Management Architectures

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More