Data Engineering with AWS Chapter 6 Part 2: Ingesting Streaming Data

This is post 10 in my Data Engineering with AWS retelling series.

Part 1 covered batch ingestion – pulling data from databases into S3 on a schedule. But not all data waits politely for a nightly load. IoT sensors, vehicle telemetry, live gameplay events, social media mentions – this data streams in continuously and often needs to be processed in near-real-time.

Chapter 6 Part 2 is about catching that stream and landing it in your data lake.

Why Streaming Matters

The volume of real-time data has exploded. Boeing’s Airplane Health Management system collects in-flight data from aircraft and relays it to ground systems in real time so maintenance crews can act immediately. BMW ingests telemetry from millions of vehicles, processing terabytes of anonymous data every day. Mobile games track gameplay events as they happen to detect cheating, trigger rewards, or personalize experiences.

If your pipeline only runs nightly batch loads, you are 24 hours behind reality. For some use cases, that is fine. For others, it is a dealbreaker. The key question for each data source is: how quickly does the business need insights from this data?

Amazon Kinesis vs Amazon MSK

AWS offers two primary services for streaming ingestion: Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (MSK). Both are pub-sub systems where producers write messages and consumers read them. Both scale to millions of messages per second.

But they are very different under the hood.

Serverless vs Managed

Kinesis is serverless. You never think about servers. With Kinesis Data Streams, you pick the number of shards (throughput units) and AWS handles everything else. With Kinesis Data Firehose, you do not even pick shards – it auto-scales based on incoming traffic.

MSK is managed. AWS runs the infrastructure, but you still choose instance types, configure VPC networking, pick a Kafka version, and tune Kafka settings. More control, more knobs to turn, more things that can go wrong.

If your team already knows Kafka and needs fine-grained control, MSK makes sense. If you are new to streaming or want the fastest path to production, Kinesis wins on simplicity.

Open Source vs AWS Native

MSK runs Apache Kafka, which is open source with a massive ecosystem. There are hundreds of connectors for different sources and sinks – PostgreSQL, Elasticsearch, S3, you name it. The community is huge and actively contributing.

Kinesis is proprietary AWS software. It integrates tightly with S3, Redshift, Elasticsearch, and a handful of third-party services like Splunk and Datadog through Firehose. The integration list is shorter but the integrations that exist are deep and well-tested.

Check which integrations your use case actually needs before deciding. If you need to connect to something niche, Kafka’s connector ecosystem is hard to beat.

At-Least-Once vs Exactly-Once

This is a big one for data quality.

Kinesis guarantees at-least-once delivery. Every message reaches a consumer, but some messages might be delivered more than once. Your application needs to handle duplicates. AWS provides guidance on this in the Kinesis documentation under “Handling Duplicate Records.”

Kafka (MSK) supports exactly-once delivery since version 0.11. You configure it with processing.guarantee=exactly_once. Each message is delivered to the consumer exactly one time. No duplicates.

If your use case cannot tolerate duplicates – financial transactions, for example – MSK with exactly-once semantics is the safer bet. If occasional duplicates are manageable (and you build deduplication into your processing layer), Kinesis works fine.

Specialized Sub-Services

Kinesis has something MSK does not: purpose-built sub-services for specific use cases.

  • Kinesis Data Streams – the core pub-sub engine, comparable to Kafka topics.
  • Kinesis Data Firehose – a delivery pipeline that automatically writes streaming data to S3, Redshift, or Elasticsearch with zero code. You configure it and it just works.
  • Kinesis Video Streams – designed specifically for streaming audio and video data. If you are ingesting camera feeds or audio streams, this simplifies things enormously.
  • Kinesis Data Analytics – run SQL queries or Apache Flink applications against your stream in real time.

Kafka is one powerful general-purpose engine. Kinesis is a family of specialized tools. Sometimes a specialized tool is exactly what you need.

How to Decide

Start with Kinesis. Evaluate whether one of its sub-services meets your requirements. If it does, you get faster setup, less maintenance, and tight AWS integration.

Consider MSK if you need:

  • Exactly-once message delivery
  • Fine-tuned performance control
  • Integration with third-party products not available in Kinesis
  • An existing team with Kafka expertise

Hands-On: Streaming Data with Kinesis Firehose

The chapter walks through a practical exercise that connects Kinesis Data Firehose to S3. You generate fake streaming data using the Amazon Kinesis Data Generator (KDG), an open source tool from AWS.

Setting Up Firehose

  1. Create a Firehose delivery stream with source set to “Direct PUT” and destination set to Amazon S3.
  2. Configure the S3 prefix to organize data by year and month: streaming/!{timestamp:yyyy/MM/}.
  3. Set the error prefix to !{firehose:error-output-type}/!{timestamp:yyyy/MM/}.
  4. Configure buffering – 1 MB buffer size and 60 second buffer interval. Whichever threshold is hit first triggers a write to S3.

The buffer settings matter. With a 128 MB buffer and 900 second interval (the maximums), data could sit in memory for up to 15 minutes before landing in S3. For the exercise, the smaller settings mean data appears in S3 within a minute.

Generating Test Data with KDG

The Kinesis Data Generator is a browser-based tool hosted on GitHub. It connects to your AWS account through Amazon Cognito. You deploy a CloudFormation template that creates the Cognito user, then log into KDG with those credentials.

The exercise simulates a streaming data feed from movie distribution partners. Remember the Sakila database from Part 1? That fictional DVD rental company has gone digital. Now they receive real-time data when their classic movies are rented, purchased, or previewed on streaming platforms.

Here is the KDG template that generates the fake events:

{
    "timestamp": "{{date.now}}",
    "eventType": "{{random.weightedArrayElement(
      {
        "weights": [0.3, 0.1, 0.6],
        "data": ["rent", "buy", "trailer"]
      }
    )}}",
    "film_id": {{random.number({"min":1, "max":1000})}},
    "distributor": "{{random.arrayElement(
        ["amazon prime", "google play", "apple itunes",
         "vudo", "fandango now", "microsoft", "youtube"]
    )}}",
    "platform": "{{random.arrayElement(
        ["ios", "android", "xbox", "playstation",
         "smart tv", "other"]
    )}}",
    "state": "{{address.state}}"
}

Notice the weighted randomness: 60% of events are trailer views, 30% are rentals, and only 10% are purchases. That is realistic. Most people browse before buying.

You set the generator to 10 records per second, let it run for 5 to 10 minutes, and stop it. That produces enough data for the exercises in future chapters.

Cataloging and Querying

After the data lands in S3, you run a Glue Crawler against the streaming/ prefix. The crawler infers the schema from the JSON files, creates a table in the Glue Data Catalog, and automatically detects the year/month partitions from the S3 prefix structure.

Then you query with Athena:

SELECT * FROM streaming LIMIT 20;

The results show the generated events with all the fields from the KDG template, plus the partition columns the crawler detected.

Batch vs Streaming: It Is Not Either-Or

Most real-world pipelines use both. Your customer database gets batch-loaded nightly with DMS. Your web clickstream data flows in real time through Kinesis. Your weather data arrives as a daily CSV file drop. Each source gets the ingestion method that fits its velocity.

The architecture you sketched in Chapter 5 probably has lines from both batch and streaming sources feeding into the same data lake. That is normal. The data lake does not care how data arrived – it is all files in S3 at the end of the day. The important thing is that your ingestion strategy matches the velocity and value of each data source.

Key Takeaway

Streaming ingestion is simpler than it sounds. Kinesis Data Firehose in particular is almost comically easy to set up – pick a source, pick a destination, set your buffer sizes, and it handles everything else. The hard part is not the plumbing. It is understanding your data well enough to know whether streaming is worth the added complexity versus a batch approach.

Not everything needs to be real-time. But when it does, AWS gives you solid options.

Next up, Chapter 7 covers transforming all this ingested data to make it actually useful for analytics.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 6 Part 1 - Batch Ingestion Next: Chapter 7 Part 1 - Transforming Data

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More