Data Engineering with AWS Chapter 3 Part 1: The AWS Toolkit - Storage and Databases

Chapter 3 is massive. It is basically a catalog of every AWS service a data engineer will touch, from getting data in to getting answers out. So I am splitting it into two posts. This first part covers how data gets into AWS – all the ingestion services, the streaming tools, and the physical devices AWS will literally ship to your door.

This is post 4 in my Data Engineering with AWS retelling series.

Why Ingestion Matters

Before you can analyze anything, you need to actually get the data into the cloud. Sounds obvious, but it is harder than it sounds. Your data might live in an old Oracle database in some closet, or stream in real time from millions of IoT sensors, or sit in a Salesforce account managed by your marketing team. Each source needs a different approach. AWS built a bunch of services to handle these different scenarios so you do not have to build everything from scratch.

Amazon DMS: Syncing Databases to the Cloud

Amazon Database Migration Service (DMS) is your go-to when you need to pull data out of traditional databases and into your AWS data lake. The classic use case: you have a production database running Oracle or MySQL, and you want a copy of that data in S3 for analytics.

DMS does two things really well:

  1. Full load – it grabs everything from your source database and writes it to S3 in CSV or Parquet format.
  2. Ongoing replication – after the full load, it keeps watching for changes using the database transaction logs.

That second part is called Change Data Capture (CDC). Every time someone inserts, updates, or deletes a record in the source database, DMS writes that change to a file in S3 with an extra column that tells you what happened. An “I” means insert, “U” means update, “D” means delete. A separate process then applies those changes to build a fresh snapshot.

When to use it: You need to keep your data lake in sync with production databases on an ongoing basis. Works great for moving data from one database engine to a completely different one.

When to skip it: If you are migrating to the same engine (say, MySQL to MySQL on AWS), native database tools usually work better.

Amazon Kinesis: Real-Time Streaming

If DMS is for database-to-lake syncing, Amazon Kinesis is for everything that moves fast. Website clicks, log files, IoT sensor data, video feeds. Kinesis is actually a family of four services:

  • Kinesis Data Firehose – the simplest option. It buffers incoming data for a configurable period (1-15 minutes or based on size), then writes it to S3, Redshift, Elasticsearch, or third-party services like Splunk. You can even convert data to Parquet or ORC on the way through. Great for “I just want my log files in S3” use cases.

  • Kinesis Data Streams – for when you need low latency. Data becomes available to your consuming applications within about 70 milliseconds. Netflix uses this to process terabytes of log data daily. You can consume the stream with Lambda functions, custom EC2 applications, or other Kinesis services.

  • Kinesis Data Analytics – lets you run SQL queries or Apache Flink code directly on streaming data. Perfect for questions like “how many sales of product X happened in the last 5 minutes?”

  • Kinesis Video Streams – handles streaming video, audio, thermal imagery, and RADAR data. Think video doorbells, security cameras, baby monitors.

AWS also provides the Kinesis Agent, a Java app you install on your servers. It watches files (like Apache web server logs), buffers the data, and sends it to Firehose or Data Streams automatically.

One thing worth noting: Kinesis is not bulletproof. In November 2020, Kinesis in the Northern Virginia region had hours of increased error rates. Roomba vacuums, Ring doorbells, The Washington Post, and Roku were all affected. Werner Vogels, Amazon’s CTO, likes to say “everything fails all the time.” Design accordingly.

Amazon MSK: Managed Kafka

Amazon Managed Streaming for Apache Kafka (MSK) is for teams that already use or want Apache Kafka. Kafka is wildly popular for building streaming data pipelines, but running it yourself is painful. MSK handles the deployment, monitoring, and failed component replacement.

When to use it: You are migrating an existing Kafka cluster, or you want the rich third-party integration ecosystem that Kafka offers.

When to skip it: If you are starting fresh, Kinesis is serverless and you only pay for throughput. MSK charges you for the cluster whether data is flowing or not.

Amazon AppFlow: SaaS Data Connector

Got data in Salesforce, Marketo, Google Analytics, ServiceNow, Slack, or Zendesk? Amazon AppFlow pulls data from these SaaS services and writes it to S3, Redshift, or Snowflake. It can run on a schedule or respond to events, and it handles filtering, masking, and validation along the way.

This is huge for data engineers. Instead of building custom API integrations for every SaaS product your company uses, AppFlow gives you a point-and-click setup to get that data flowing into your lake.

Amazon Transfer Family: Old-School File Transfers

Some organizations still exchange data via FTP and SFTP. It is not glamorous, but it works and it is everywhere. Amazon Transfer Family gives you a managed FTP/SFTP endpoint that writes directly to S3.

A real estate company receiving MLS listing files via SFTP, for example, can migrate to Transfer Family with almost zero change on the sender’s side. The files just land in S3 instead of an on-premises server, and your data pipeline takes it from there.

Amazon DataSync: On-Premises to Cloud

Amazon DataSync handles high-performance data transfers from on-premises storage (NFS file shares, SMB shares, S3-compatible object storage) into AWS. Perfect for syncing end-of-day transaction files from a data center or moving large amounts of historical data into your S3 data lake over a network connection.

The AWS Snow Family: When the Internet Is Not Enough

Sometimes your dataset is so enormous that sending it over the network would take months or years. That is when AWS will literally ship you a device:

  • Snowcone – 4.5 pounds, 8 TB of storage. Fits in a backpack.
  • Snowball Edge – about 50 pounds, 80 TB of storage. For serious data moves.
  • Snowmobile – a 45-foot shipping container pulled by a semi-truck. Up to 100 petabytes. Yes, really.

You load your data onto the device, ship it back to AWS, and they transfer it to S3. All encrypted at rest.

Wrapping Up Part 1

That covers the ingestion side of the AWS data engineering toolkit. You now know how to get data from databases (DMS), real-time streams (Kinesis), SaaS apps (AppFlow), file transfers (Transfer Family), on-premises storage (DataSync), and massive datasets (Snow devices) into your AWS data lake.

In Part 2, we will look at what happens after the data arrives: transforming it with Lambda and Glue, orchestrating pipelines with Step Functions and Airflow, querying with Athena, warehousing with Redshift, and visualizing with QuickSight. That is where the real fun starts.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 2 - Data Management Architectures Next: Chapter 3 Part 2 - Analytics and Processing Services

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More