Data Engineering with AWS Chapter 8: Who Actually Uses All This Data?

This is post 13 in my Data Engineering with AWS retelling series.

We have spent the last several chapters ingesting data, transforming data, optimizing data. Pipelines everywhere. But here is the question nobody asks often enough: who is actually going to use all of this?

Chapter 8 flips the perspective. Instead of thinking about what data we have, we think about who needs it and how they want to consume it. Turns out, this should have been the starting point all along.

Data Democratization and Data Gravity

For decades, if you wanted data in a company, you went to the IT department. You filed a request. You waited. Maybe you got a spreadsheet back a few weeks later. Maybe you did not.

Those days are over. Eagar introduces the concept of data democratization, which is a fancy way of saying everyone in the organization expects access to data now. Not next month. Not after a request goes through three approval layers. Now.

He also brings up an interesting idea called data gravity, a term coined by Dave McCrory. The concept is simple: data has mass. As a dataset grows bigger, it attracts more users and applications. It becomes harder to move. Think of it like a planet. The bigger it gets, the more things orbit around it.

This is why modern data pipelines should store data in a way that lets people interact with it where it already lives. You do not want to copy terabytes of data every time someone needs to run a query. You want tools that can reach into the data lake and work with it directly.

The Four Types of Data Consumers

Eagar breaks down data consumers into distinct groups. Each one has different needs, different tools, and different expectations. As a data engineer, your job is to serve all of them.

Business Users

These are the executives, managers, and operational staff who need data to make decisions. They are not writing SQL. They are not building models. They want dashboards that update automatically and show them what they need to know at a glance.

Some of them are Excel power users who love pivot tables. Others just want a clean chart showing last quarter’s revenue by region. The range is wide.

On AWS, the primary tool for business users is Amazon QuickSight. It is a cloud-based BI tool that lets you build interactive dashboards pulling from multiple data sources. Users can filter, sort, drill down into specifics, and access dashboards from their phone or embedded in existing web portals.

QuickSight connects to a lot of sources: S3 data lakes, Redshift, MySQL, Oracle, Salesforce, Jira, ServiceNow, and more. As a data engineer, you might help set up those connections or build new curated datasets so business users can self-serve without needing to join tables themselves.

The key shift here is that business users no longer want to go through data gatekeepers. They want direct access. Your pipeline needs to deliver data in a format and location that QuickSight (or whatever BI tool your org uses) can consume easily.

Data Analysts

If business users are the people who look at data to make decisions, data analysts are the people whose entire job is the data itself. They dig deep. They write queries. They find patterns nobody else noticed.

A data analyst’s typical work includes:

  • Cleansing data and ensuring quality when dealing with new or ad hoc sources
  • Becoming a domain specialist for their part of the business
  • Running SQL queries to answer specific business questions
  • Creating visualizations for business users using BI tools
  • Statistical analysis to identify trends and areas of concern

The questions they answer are specific and often complex. What percentage of customers browsed the site more than 5 times in 2 weeks but never bought anything? Which products are most popular among different age demographics?

On AWS, data analysts use several tools:

  • Amazon Athena for running SQL queries against data in S3, Redshift, and other sources. Athena is serverless. You write a query, it runs, you pay per query. No infrastructure to manage.
  • AWS Glue DataBrew for visual data cleansing and transformation. Over 250 built-in transforms, no code required. Analysts can connect to Redshift, Snowflake, S3, and more.
  • Python and R for advanced analysis, running on Lambda, Glue Python Shell, EC2, or Amazon EMR for large datasets.

One thing Eagar emphasizes is the relationship between data analysts and data engineers. Analysts often build ad hoc pipelines to answer one-off questions. When those pipelines become something the business depends on regularly, a data engineer should step in to formalize them. Put them in source control. Add proper monitoring. Make them production-grade.

Data Scientists

Data scientists are the folks building machine learning models. While analysts look at what happened and what is happening now, data scientists try to predict what will happen next.

Their work involves:

  • Identifying non-obvious patterns in data (given these blood test results, what is the likelihood of a specific condition?)
  • Predicting future outcomes (will this customer default on their loan?)
  • Extracting metadata from unstructured data (is the person in this photo smiling? wearing sunglasses?)

Data scientists are hungry for data. Most ML approaches need large volumes of raw, non-aggregated data to train models. They do not want your neatly summarized quarterly reports. They want the raw transaction logs, the full clickstream, the unprocessed sensor readings.

AWS offers a suite of tools under Amazon SageMaker:

  • SageMaker Ground Truth for labeling datasets. Say you have 10,000 images of dogs and cats but none are labeled. Ground Truth uses its own ML model to auto-label what it can, then routes uncertain items to human labelers (either your team or Amazon Mechanical Turk contractors). This saves weeks of manual work.
  • SageMaker Data Wrangler for data preparation. Data scientists reportedly spend up to 70% of their time cleaning data. Data Wrangler offers over 300 built-in transforms and supports custom transformations in PySpark and pandas. You can export the flow as a Jupyter Notebook or Python code.
  • SageMaker Clarify for detecting bias in training data. If your credit risk model is trained mostly on middle-aged customers, it might be unreliable for younger or older people. Clarify flags these imbalances before they become problems in production.

Machine-to-Machine Consumers

This one is easy to overlook. Data consumers are not always humans. Applications consume data too.

Call centers want real-time transcripts of audio calls for sentiment analysis. Engagement platforms track every email open, every click, every ignored notification to personalize the customer journey. Manufacturing systems pull IoT sensor data to predict when machines need maintenance before they break.

These application-level consumers often have the strictest latency requirements. They need data now, not tomorrow morning when the batch job runs.

Hands-On: Building a Mailing List with Glue DataBrew

The chapter includes a practical exercise where you play the role of a data analyst. The scenario: your video store is closing its physical locations and going streaming-only. Marketing needs a mailing list to notify customers.

You use AWS Glue DataBrew to:

  1. Connect two datasets from the Glue Data Catalog: the customer table and the address table (both originally ingested from MySQL in Chapter 6)
  2. Join the tables using a left join on address_id
  3. Select only the columns marketing needs: customer_id, first_name, last_name, email, address, district, postal_code
  4. Apply formatting transforms: change first_name and last_name to capital case, convert email to lowercase
  5. Run the job and write the output as a CSV file to S3

The whole thing is done through DataBrew’s visual interface. No code. The exercise also highlights an important distinction: Glue Studio generates Spark code you can edit and run anywhere, while Glue DataBrew is a closed system. DataBrew jobs only run inside DataBrew. Glue Studio is more flexible for engineers. DataBrew is more accessible for analysts.

Key Takeaway

Chapter 8 reinforces something that came up in the pipeline design chapter: start with the consumer. Everything a data engineer builds is ultimately in service of someone who needs that data to do their job or some application that needs that data to function.

The variety of consumers is growing. Business users want real-time dashboards on their phones. Analysts want serverless SQL engines. Data scientists want raw data and labeling tools. Applications want streaming feeds. Each group has different requirements for format, latency, volume, and tooling.

Your job as a data engineer is to understand all of them and build pipelines that deliver the right data, in the right format, through the right tools, at the right time.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 7 Part 2 - Transforming Data Next: Chapter 9 Part 1 - Data Mart

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More