Data Engineering with AWS Chapter 6 Part 1: Ingesting Batch Data

This is post 9 in my Data Engineering with AWS retelling series.

You have your whiteboard architecture from Chapter 5. You know who your data consumers are and what they need. Now it is time to actually move data. Chapter 6 covers data ingestion – getting data from wherever it lives into your AWS data lake. This first part focuses on batch ingestion from databases and files. Part 2 covers streaming.

The Five Vs of Data

Before you pick an ingestion tool, you need to understand your data. Eagar frames this around the five Vs:

  • Variety – what format is the data in? Structured tables, semi-structured JSON, unstructured images?
  • Volume – how big is the dataset? Megabytes, terabytes, petabytes?
  • Velocity – how fast is the data generated? Once a day or thousands of records per second?
  • Veracity – how trustworthy is the data? Are there gaps, errors, missing fields?
  • Value – does this data actually help the business? If not, do not bother ingesting it.

The last one is easy to forget. You could build a beautiful pipeline that moves terabytes every night, but if nobody uses the output, you wasted time and money. Always start with value.

Know Your Data Types

The chapter breaks data variety into three categories. Structured data lives in rows and columns with a fixed schema – databases, CSV files, spreadsheets. Semi-structured data uses formats like JSON and XML where records can have different fields (jeans might have fit_type and size while batteries do not). Unstructured data covers everything else: PDFs, images, videos, audio. You cannot query a photo with SQL, but services like Amazon Rekognition and Amazon Comprehend can extract structured metadata from unstructured sources.

For each data source, sit down with the data owner. Ask about format, historical volume, growth rate, update frequency, and business value. These answers drive every tool and infrastructure decision.

Ingesting from Relational Databases

Most organizations have data locked up in relational databases. MySQL, PostgreSQL, SQL Server, Oracle – the usual suspects. The goal is to get that data into an S3-based data lake where it can be combined with other sources and analyzed at scale.

AWS Database Migration Service (DMS)

DMS is the primary AWS tool for database-to-S3 ingestion. It handles both one-time historical loads and ongoing replication using Change Data Capture (CDC).

Here is how it works. You spin up a replication instance (a managed EC2 server). That instance connects to your source database, reads data, formats it for the target, and writes it to S3. DMS is a managed service but not serverless – you are paying for that EC2 instance while it runs.

For ongoing replication, DMS reads the database’s transaction log files to detect inserts, updates, and deletes. This is CDC. It puts less load on the source database than running full queries repeatedly.

One thing to watch out for: DMS has specific requirements for each database engine. For MySQL CDC, you need binary logging enabled with specific settings. For SQL Server 2008, certain audit configurations will break DMS entirely. Always review the DMS documentation with your database admin team before committing to this approach.

AWS Glue for Database Ingestion

Glue can also connect to databases via JDBC and pull data into S3. It is a good fit for two scenarios:

Full table loads. Glue connects to the database, runs a SELECT *, reads everything into Spark memory, and writes it out to S3 in Parquet format.

Incremental loads with bookmarks. Glue tracks which records it has already processed using a bookmark key – a column that always increases in value, like a transaction ID. On the first run, it loads everything. On the next run, it only loads records with a higher ID than the last bookmark. This works great for append-only tables like audit logs. It does not work for tables where rows get updated or deleted.

Lake Formation blueprints. If you want to skip the manual setup, AWS Lake Formation has ingestion blueprints that auto-create the Glue Job, Glue Crawlers, and Glue Workflow for you with a few clicks.

Other Approaches

  • Amazon EMR with Apache Spark or Apache Sqoop can also connect to databases via JDBC. More flexibility, more operational overhead.
  • RDS Snapshot Export is the simplest option if your database runs on Amazon RDS (MariaDB, MySQL, or PostgreSQL). RDS exports a snapshot directly to S3 in Parquet format. Zero load on your production database.
  • Third-party tools like Qlik Replicate (formerly Attunity) offer advanced CDC capabilities across many source and target types, but they come at a premium price.
  • Native database tools like Oracle GoldenGate can generate CDC data as a Kafka stream. Powerful but often separately licensed.

How to Choose the Right Ingestion Tool

Eagar lays out three main factors for deciding which approach fits your use case.

Database Size

For large databases (tens of gigabytes or more), a full nightly load is impractical. It takes too long and hammers the source system. The better approach: do an initial full load, then use DMS with CDC to sync ongoing changes.

For very large databases, DMS can work with an Amazon Snowball device. You load data onto the physical device in your data center, ship it to AWS, and they load it into S3. DMS captures all CDC changes while the device is in transit so you do not lose anything.

For small databases, a scheduled full load with Glue or RDS snapshot export is often the simplest and cheapest option.

Database Load

If your source database has heavy production traffic around the clock, you want to minimize additional strain. Use DMS with CDC against the transaction logs (lower impact than full queries). Better yet, run the initial full load against a read replica instead of the primary.

If the database runs on RDS, the snapshot export approach puts zero load on the production instance since it works from a snapshot, not the live database.

Ingestion Frequency

If the business needs data as soon as it changes, DMS with CDC is the right call. Remember though – CDC data tells you what changed (inserts, updates, deletes). You still need an ETL process to apply those changes to your existing dataset. Chapter 7 covers that.

If the business is fine with a nightly refresh, a scheduled full load or a nightly CDC apply process works well and is simpler to manage.

Technical Compatibility

This one bites people. A data engineering team picks DMS based on requirements, starts the project, and then discovers the database admin team will not enable binary logging. Or the source database version has a known incompatibility.

Always review tool requirements with the database owners and admin team before finalizing your approach. The DMS documentation lists every requirement and limitation for each supported engine. Read it together with the people who manage the source system.

Hands-On: DMS from MySQL to S3

The chapter includes a step-by-step exercise for replicating a MySQL database to S3 using DMS. Here is what you do:

  1. Create a MySQL RDS instance using the free tier (db.t2.micro).
  2. Load the Sakila demo database using an EC2 instance with a user data script that installs MariaDB, downloads the Sakila schema and data files, and runs them against your MySQL instance.
  3. Create an IAM role granting DMS permission to write to your S3 landing zone bucket.
  4. Set up DMS – create a replication instance (dms.t3.micro), a source endpoint pointing to your MySQL database, and a target endpoint pointing to your S3 bucket with the AddColumnName setting enabled.
  5. Create and run a migration task configured for “Migrate existing data” (full load).
  6. Query with Athena – after DMS loads the CSV files to S3 and your Lambda function converts them to Parquet, query the data using SELECT * FROM film LIMIT 20.
  7. Clean up – delete the DMS task and replication instance to avoid ongoing charges.

The key command for the EC2 user data script:

#!/bin/bash
yum install -y mariadb
curl https://downloads.mysql.com/docs/sakila-db.zip -o sakila.zip
unzip sakila.zip
cd sakila-db
mysql --host=<HOST> --user=admin --password=<PASSWORD> -f < sakila-schema.sql
mysql --host=<HOST> --user=admin --password=<PASSWORD> -f < sakila-data.sql

Replace <HOST> with your RDS endpoint and <PASSWORD> with your database password.

Key Takeaway

Batch ingestion is not glamorous, but it is the foundation of every data pipeline. The most important skill is not knowing how to configure DMS or write a Glue job. It is knowing which tool to pick for which situation. A small database on RDS with nightly refresh needs? Snapshot export. A large production database with real-time replication needs? DMS with CDC. An append-only audit table? Glue with bookmarks.

Pick the simplest tool that meets your requirements. You can always upgrade later.

In Part 2, we tackle streaming ingestion with Amazon Kinesis and MSK.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 5 - Architecting Pipelines Next: Chapter 6 Part 2 - Streaming

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More