Building an End-to-End Big Data Pipeline - Part 3

Batch processing is great for historical reports, but what if you need to know what’s happening right now? In the final part of Chapter 10, Neylson Crepalde shows us how to build a world-class Real-Time Pipeline on Kubernetes.

This is the “final boss” of data engineering projects. We’re connecting four major systems into a single flowing stream of data.

The Source: RDS Postgres

We start with a standard relational database on AWS. To simulate real traffic, we run a Python script that continuously “upserts” fake customer data into a Postgres table.

The Ingestion: Kafka Connect

Instead of writing custom code to poll the database, we use Kafka Connect with a JDBC Source Connector.

This connector acts like a bridge. It watches the Postgres table and, every time it sees a new timestamp, it grabs the data and publishes it as a JSON message to a Kafka topic called src-customers. No code required—just a YAML configuration file.

The Engine: Spark Structured Streaming

Now the data is in Kafka, but it’s still raw. We want to transform it. We deploy a Spark Streaming job on our cluster that:

  1. Subscribes to the src-customers topic.
  2. Parses the JSON payload.
  3. Performs a real-time calculation (like computing a customer’s age from their birthdate).
  4. Re-packages the data into a special JSON format that Elasticsearch understands.
  5. Publishes the transformed data to a new Kafka topic: customers-transformed.

The Destination: Elasticsearch & Kibana

The final step is getting that data into a dashboard. We use another Kafka Connector—the Elasticsearch Sink Connector.

This connector listens to the customers-transformed topic and immediately indexes every message into an Elasticsearch cluster. From there, we open Kibana, create a Data View, and we can literally watch our customer data pop up on charts in real-time.

Why this is the future

Building this on Kubernetes is what makes it feasible. You have separate namespaces for Kafka, Elastic, and Spark. You use Kubernetes Secrets to manage the SSL certificates and database passwords. If your stream gets too heavy, you just scale up your Spark executors or Elasticsearch data nodes.

You’ve now built both a Batch and a Real-Time engine. You have a complete, professional data platform. But there’s one more “frontier” to explore: Generative AI.

Next: Generative AI on Kubernetes Previous: Building an End-to-End Big Data Pipeline - Part 2

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More