Data Engineering with AWS Chapter 13: Enabling AI and Machine Learning

This is post 19 in my Data Engineering with AWS retelling series.

Throughout this book, we have been ingesting data, transforming data, storing data, querying data, and visualizing data. All of that is incredibly useful on its own. But there is a whole other level where data gets really powerful: when you use it to teach a machine to make predictions.

Chapter 13 is where the book steps into AI and machine learning territory. And before you close the tab thinking this is only for PhD researchers, hold on. A big chunk of this chapter is about services where you do not write a single line of ML code. You just call an API and get back results.

What Is Machine Learning, Really?

Strip away the hype and ML is a straightforward idea. You take a large amount of data, feed it into an algorithm, and the algorithm learns patterns. The result is a model – a trained thing that can take new, unseen data and make predictions about it.

Traditional software works with rules you write by hand. Machine learning flips that. Instead of writing rules, you give the system thousands of examples and it figures out the rules itself. Nobody sat down and wrote ten thousand if-else statements to catch every spam email. They fed millions of labeled emails into an algorithm and let it learn what spam looks like.

Where ML Is Already Everywhere

The book gives some impressive real-world examples to show this is not just academic stuff.

Healthcare: Cerner, a huge healthcare IT company, uses Amazon SageMaker to predict optimal hospital staff schedules and patient workflows. Researchers have used ML to detect early-stage cancer from CT scans with 94% accuracy. That is the kind of accuracy that saves lives.

Sports: The NFL uses ML to track player movements on the field and identify situations that carry high injury risk. The goal is to change rules or equipment before players get hurt.

Everyday life: When Netflix recommends a show you end up binge-watching, that is ML. When Amazon suggests a product and you think “how did they know,” that is ML. When Alexa understands your question even though you mumbled half of it, that is ML too. Sales forecasting, inventory management, fraud detection – ML is already baked into most of the services you use daily.

The AWS AI/ML Stack: Three Layers

AWS organizes its ML offerings into three layers, from easiest to hardest.

Top layer: AI Services. Pretrained models behind simple APIs. You send data in, you get predictions out. Think of it like ordering food at a restaurant. You do not need to know how to cook.

Middle layer: ML Services. This is Amazon SageMaker. Tools to build, train, and deploy your own custom models. You need some ML knowledge, but SageMaker handles the heavy infrastructure.

Bottom layer: ML Frameworks and Infrastructure. For the deep experts. TensorFlow, PyTorch, MXNet running on EC2 GPU instances. Full control, full responsibility.

For data engineers, the top two layers matter most. Let us look at each.

Amazon SageMaker: The Full ML Workshop

SageMaker is like a complete workshop where data scientists do everything from preparing data to deploying models in production. The book walks through its key pieces:

Data Preparation:

  • Ground Truth handles data labeling. If you have 50,000 images but none are tagged as “cat” or “dog,” Ground Truth uses a combination of automated labeling and human reviewers to get it done efficiently.
  • Data Wrangler offers a visual interface with over 300 built-in transformations for cleaning and preparing data. Data scientists reportedly spend up to 70% of their time just cleaning data, so this is a big deal.
  • Clarify checks your training data for bias. If your credit risk model was trained mostly on data from one demographic, Clarify flags that before the model goes to production and starts making unfair decisions.

Building Models:

  • Studio Notebooks provide a Jupyter Notebook environment for writing and testing code.
  • Autopilot is for when you want SageMaker to do the heavy lifting. Give it a dataset, tell it what you want to predict, and Autopilot automatically tries different algorithms, tunes them, and picks the best one.
  • JumpStart offers prebuilt model solutions you can deploy quickly. Need a text classification model? There is probably one ready to go.

Training and Deployment:

  • SageMaker handles distributed training across multiple machines for large datasets.
  • Hyperparameter tuning tests different model settings to find the best combination.
  • Once trained, you can deploy models for real-time predictions (API endpoint) or batch predictions (process a whole file at once).
  • Model Monitor watches deployed models for quality drift. Models degrade over time as real-world data shifts away from the training data. Monitor catches that before your predictions go bad.

AWS AI Services: The No-Code ML Option

This is the layer that matters most to data engineers who are not ML specialists. These are pretrained models you access through simple API calls. No training. No tuning. Just send data and get results.

Amazon Transcribe converts speech to text. It can identify different speakers in a conversation, handle multiple languages, and even remove personally identifiable information (PII) from transcripts automatically.

Amazon Textract pulls text out of scanned documents, PDFs, and images. Not just printed text – it handles handwriting too. If your pipeline receives thousands of scanned invoices daily, Textract can extract the data without anyone manually reading them.

Amazon Comprehend does natural language processing. Feed it a block of text and it identifies entities (people, places, organizations), detects the language, and determines the sentiment (positive, negative, neutral, mixed).

The book gives a great example. Imagine a review that says: “I recently visited Jack’s Cafe and the food was delicious.” Comprehend would extract entities like “Jack’s Cafe” (tagged as Organization) and determine the overall sentiment as Positive with 99% confidence. Simple input, structured output. Very useful for processing customer feedback at scale.

Amazon Rekognition analyzes images and video. Upload a photo and it tells you what is in it. The book example: a photo of a dog in snow near a car. Rekognition returns labels like “Snow” (99% confidence), “Dog” (93%), and “Car” (90%). It can also detect faces, identify celebrities, find inappropriate content, and track objects across video frames.

Amazon Forecast handles time-series predictions. Feed it historical sales data and it predicts future demand. The clever part is that it can incorporate external factors like weather data. An ice cream shop’s sales depend heavily on temperature, and Forecast can factor that in automatically.

Amazon Fraud Detector does exactly what the name says. It analyzes transaction patterns to catch fraud. You define a model based on your historical data (legitimate and fraudulent transactions), and it scores new transactions in real time.

Amazon Personalize powers recommendation engines. The same technology behind “customers who bought this also bought that,” available as a service. Feed it user interaction data and it generates personalized recommendations.

The Hands-On: Sentiment Analysis Pipeline

The chapter exercise builds a practical pipeline that processes hotel reviews through Comprehend:

  1. Reviews arrive in an SQS queue (Simple Queue Service).
  2. A Lambda function picks up each review from the queue.
  3. Lambda sends the review text to Amazon Comprehend for sentiment analysis.
  4. Comprehend returns the sentiment (Positive, Negative, Neutral, Mixed) with confidence scores.
  5. The results get stored for further analysis.

This is a realistic pattern. Hotels, restaurants, and e-commerce companies process thousands of reviews daily. Instead of hiring people to read every single review and categorize them, you pipe them through Comprehend and get instant classification. You can then alert the customer service team only when negative reviews come in, or track sentiment trends over time.

Why Data Engineers Should Care

You might think ML is someone else’s job. But the lines are blurring. Data engineers increasingly build pipelines that feed ML models or consume their output. You might prepare training data for a data scientist. You might build a real-time pipeline that sends customer messages to Comprehend and routes them based on sentiment. You might pipe scanned documents through Textract into your data lake.

Understanding what these AI services can do makes you a better pipeline designer. You know what is possible, so you build for it.

Key Takeaway

AI and ML are not just buzzwords reserved for research labs anymore. AWS has made it accessible at every skill level. If you are a data scientist who wants full control, SageMaker gives you the complete toolkit. If you are a data engineer who just wants to add intelligence to a pipeline, the AI Services layer lets you call an API and get predictions back in seconds.

The most practical takeaway from this chapter: you do not need to become an ML expert to use ML in your data pipelines. Amazon Comprehend, Rekognition, Textract, Forecast, and the other AI services are just API calls. If you can write a Lambda function, you can add ML to your workflow.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 12: Visualizing Data with Amazon QuickSight Next: Chapter 14: Wrapping Up the Learning Journey

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More