Building a Kafka Cluster - Study Notes from Data Engineering with Python Ch 12

Up to this point in the book, everything has been batch processing. You query a database, get a full dataset, transform it, load it somewhere. The data sits still while you work on it.

Chapter 12 of Data Engineering with Python by Paul Crickard changes direction. Now the data is moving. It is streaming in real time, and it might never stop. Welcome to Apache Kafka.

This is a short, hands-on chapter. No theory essays. You download Kafka, configure it, build a three-node cluster, and test it. By the end, you have a working streaming infrastructure on your local machine.

What Is Kafka and Why Do You Need ZooKeeper?

Kafka is a tool for building real-time data streams. It was originally developed at LinkedIn and is now an Apache project. The idea is simple: producers send messages to topics, and consumers read messages from those topics. The data flows continuously.

But here is the thing. Kafka does not run alone. It needs another application called ZooKeeper to manage the cluster. ZooKeeper handles coordination between nodes, tracks which brokers are alive, and elects leaders when something goes down. You cannot run Kafka without it.

The good news is that Kafka ships with ZooKeeper scripts included. You do not need to install ZooKeeper separately.

Building a Three-Node Cluster on One Machine

Most tutorials show you how to run a single Kafka node. That is fine for learning the basics. But it tells you nothing about how Kafka works in production, where you always run multiple nodes for fault tolerance.

Crickard takes a different approach. He builds a three-node cluster on a single machine. Each node gets its own folder, its own config, and its own ports. Each folder simulates a separate server. If you wanted to run this on actual separate servers, the only change would be swapping localhost for real IP addresses.

The Folder Structure

You start by downloading Kafka and extracting it. Then you copy the extracted folder three times to create three independent instances. Each instance also gets its own log directory. On top of that, you create a data directory with three ZooKeeper subdirectories inside it.

Each ZooKeeper instance needs a unique ID. You create a file called myid in each ZooKeeper data folder containing just a number: 1, 2, or 3.

The result is a clean separation. Three Kafka folders, three log folders, three ZooKeeper data folders. Everything is isolated.

Configuring ZooKeeper

Each Kafka folder has a conf directory with a zookeeper.properties file. The key settings you need to change for each instance are:

  • dataDir pointing to the correct ZooKeeper data folder (zookeeper_1, zookeeper_2, or zookeeper_3)
  • clientPort set to a unique port for each instance (2181, 2182, 2183)
  • Server list telling each instance where all three ZooKeeper nodes live

You also add timing properties: tickTime for heartbeat intervals, initLimit for how long followers can take to connect to a leader, and syncLimit for how far out of sync a follower can be before it gets dropped.

The server list is the same across all three config files. Each entry includes the server ID, a hostname, and two ports (one for follower-to-leader communication, one for leader election).

Configuring Kafka

Each Kafka instance has a server.properties file in the same conf directory. The key settings:

  • broker.id set to a unique integer (1, 2, or 3)
  • listeners set to a unique port (9092, 9093, 9094)
  • log.dirs pointing to the correct log folder
  • zookeeper.connect listing all three ZooKeeper instances

The ZooKeeper connection string is identical across all three Kafka configs. Every Kafka broker needs to know about every ZooKeeper node.

Starting Everything Up

Here is where it gets a little intense. You need six terminal windows. Three for ZooKeeper, three for Kafka.

In the first three terminals, you navigate to each Kafka folder and run the ZooKeeper startup script pointing to the ZooKeeper config file. When all three start, you will see a wall of text as the nodes discover each other and hold an election to pick a leader. Once the election finishes, things calm down.

Then in the remaining three terminals, you do the same for Kafka. Each terminal runs the Kafka startup script with the Kafka config file. When the Kafka brokers connect to ZooKeeper, you should see a log line saying “Connected.”

Six terminals, six processes, two clusters. Not the most elegant setup, but it works. Crickard notes that Docker Compose would be a cleaner way to manage this, but containers are outside the scope of this book.

Testing the Cluster

Kafka ships with command-line scripts for basic operations. You can create topics, produce messages, and consume messages without writing any code.

Creating a Topic

You run the topic creation script and pass it the ZooKeeper connection string, a replication factor of 2 (meaning each message is stored on two brokers), one partition, and a topic name. If it works, you get a single line confirming the topic was created.

You can verify by running the same script with a list flag to see all topics in the cluster.

Sending and Receiving Messages

To send messages, you start a console producer. You give it the list of Kafka broker addresses and the topic name. It opens a prompt where you can type messages.

To read messages, you start a console consumer in a separate terminal. You give it the ZooKeeper addresses, the topic name, and a flag to read from the beginning. Every message the producer sends shows up in the consumer window after a short delay.

That is the test. If the consumer reads what the producer sent, your cluster works.

Key Takeaways

  • Kafka handles real-time streaming data. Unlike batch processing where you work with complete datasets, streaming data may be infinite and incomplete.
  • ZooKeeper is required. It manages cluster coordination, node discovery, and leader election. Kafka bundles ZooKeeper scripts so you do not need a separate install.
  • A three-node cluster gives you fault tolerance. A single-node setup is fine for tutorials, but production Kafka always runs as a cluster.
  • Producers send messages to topics. Consumers read from topics. That is the core model. Everything else builds on that.
  • Replication factor determines how many copies of each message exist. A replication factor of 2 means two brokers hold each message. If one broker goes down, the data is still available.

My Take

This is a setup chapter, and it reads like one. There is not much conceptual depth here. You download, configure, start, and test. That is the whole chapter.

But I actually appreciate that Crickard shows the multi-node setup instead of the single-node shortcut. Most Kafka tutorials start with a single broker, and you learn nothing about how the cluster actually works. Running three nodes, even on the same machine, forces you to think about ports, broker IDs, replication, and coordination. These are the things that matter in production.

The six-terminal approach is rough. If you are following along, you will probably want to use Docker Compose or at least a terminal multiplexer like tmux. Managing six separate terminal windows gets old fast.

One thing this chapter does not cover is what happens when a node goes down. It would have been nice to see a test where you kill one ZooKeeper or Kafka instance and verify the cluster keeps running. That would really demonstrate why you build a cluster in the first place.

Crickard says the next chapter will cover Kafka concepts in depth, plus how to use Kafka with NiFi and Python. This chapter just gets the infrastructure in place. Think of it as the foundation for the streaming work that follows.


About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More