Streaming Data with Apache Kafka - Study Notes from Data Engineering with Python Ch 13
Up to this point in the book, data pipelines have been about moving data that already exists. Query a database, read a file, process it, store it. The data sits still and you go get it.
Chapter 13 of Data Engineering with Python by Paul Crickard changes that. Now the data is moving. It is streaming in, and you need to catch it as it arrives. This chapter covers how Apache Kafka handles streaming data, how to build producers and consumers with both NiFi and Python, and what you need to think about differently when your data never stops flowing.
What Is a Log, Really?
Before talking about Kafka, Crickard starts with something familiar: logs.
If you have written any Python, you have used the logging module. You log debug messages, warnings, errors. Each entry goes into a file in the order it happened. Add a timestamp and you know exactly when each event occurred.
Web servers do the same thing. Databases do it too, for replication and recording transactions. They all look a little different on the surface. But underneath, they share the same structure.
A log is an ordered collection of events that is append-only.
That is it. New records go at the end. Old records stay where they are. Nothing gets removed or reordered. Simple concept. But it is the foundation of how Kafka works.
How Kafka Uses Logs
Kafka stores data in logs called topics. A topic is similar to a table in a database. When you create a topic (like the dataengineering topic from the previous chapter), Kafka saves it as a log file on disk.
Partitions
A topic can be a single log, but usually it gets split into partitions for horizontal scaling. Each partition is its own log file and can live on a different server.
Here is the important detail: message order is only guaranteed within a single partition, not across the whole topic. If you have a topic with three partitions, messages in partition 1 are in order. Messages in partition 2 are in order. But the overall sequence across all three partitions is not guaranteed.
You can control which partition a message lands in by assigning a key. All messages with the same key go to the same partition. So if you need ordering for a specific customer or transaction type, assign it a key and all those messages end up in the same partition, in order.
Producers
Producers send data to a topic and partition. They can distribute messages round-robin across partitions, or use keys to target specific ones.
There are three ways a producer can send messages:
- Fire and Forget sends the message and moves on. No waiting for confirmation. Fast, but messages can get lost.
- Synchronous sends the message and waits for Kafka to confirm receipt before moving on. Safer, but slower.
- Asynchronous sends the message with a callback function. You move on immediately, and the callback handles the response when it arrives.
Consumers
Consumers read messages from a topic. They run in a continuous poll loop, waiting for new data.
A consumer can start from the beginning of a topic and read the entire history, then wait for new messages once it catches up. The consumer’s position in the topic is called the offset. Think of it as a bookmark. If a consumer reads five messages, its offset is five. It can always pick up where it left off.
Consumer Groups
Here is where it gets interesting. Say your topic has three partitions and one consumer. That single consumer has to read all three partitions by itself. If data is coming in fast, it falls behind.
The solution is consumer groups. You put multiple consumers in a group, and Kafka distributes the partitions among them. With two consumers and three partitions, one consumer reads two partitions and the other reads one. With three consumers and three partitions, each gets exactly one.
But here is the thing: if you have more consumers than partitions, the extras sit idle. There is no point having four consumers for three partitions. The fourth one has nothing to do.
You can also have multiple consumer groups reading from the same topic. Each group tracks its own offset independently. This is useful when different applications need the same data. One group feeds your dashboard. Another feeds your data warehouse. They each read the full stream without interfering with each other.
Building Kafka Pipelines with NiFi
Crickard shows how to create both a producer and consumer pipeline in NiFi. If you followed along with the production data pipeline from Chapter 11, the producer reuses that setup.
The NiFi Producer
The producer pipeline takes output from the ReadDataLake processor group and sends each record to a Kafka topic called users (created with three partitions).
The key NiFi processors:
- ControlRate slows down the data flow so it simulates real streaming instead of dumping everything at once. In the example, it lets one flowfile through per minute.
- PublishKafka_2_0 sends data to Kafka. You configure it with your broker addresses, the topic name, and a delivery guarantee. The book uses “Guarantee Replication Delivery” for reliability.
That is all it takes. Two processors and your NiFi pipeline is producing to Kafka.
The NiFi Consumer
The consumer side is just as straightforward. A ConsumeKafka_2_0 processor connects to the same Kafka cluster, subscribes to the users topic, and starts reading from the earliest offset.
You set a Group ID on the processor. This is the consumer group name. If you add a second ConsumeKafka processor with the same Group ID, they form a consumer group and split the partitions between them.
In the book’s example, two consumers share three partitions. One consumer handles two partitions, the other handles one. You see roughly two flowfiles processed by one consumer for every one processed by the other.
Adding a third consumer group is easy. Drop another ConsumeKafka processor, give it a different Group ID, and it reads the entire topic independently. In the demo, the second consumer group had no ControlRate processor attached, so it consumed the full 17-record history of the topic immediately while the first group was still throttled.
The output port from the consumer pipeline connects to the rest of your data pipeline. Kafka becomes just another data source. Once the data is in NiFi, processing it works exactly the same as if it came from a database or file.
Stream Processing vs. Batch Processing
The processing tools stay the same whether you are working with streams or batches. But there are two things you need to think about differently.
Unbounded Data
Batch data is bounded. Last year’s sales numbers have a beginning and an end. You can see all of it at once, run queries, calculate averages and maximums, validate ranges.
Streaming data is unbounded. A traffic sensor counting cars on a highway never stops producing data. You cannot wait until you have “all the data” because there is no end.
This matters for data pipelines because you cannot run the same kind of analysis on unbounded data. You cannot compute the average of a field that keeps growing. You need a different approach.
Windowing
The answer is windowing. You take a slice of the unbounded stream and treat it as bounded data within a time frame.
There are three types of windows:
Fixed (Tumbling) Windows cover a set time period with no overlap. A one-minute fixed window captures everything from 0:00 to 1:00, then 1:00 to 2:00, and so on. Each record falls into exactly one window.
Sliding Windows overlap. You define a window length (say, one minute) and a slide interval (say, 30 seconds). The first window covers 0:00 to 1:00. The next covers 0:30 to 1:30. Records near the boundary appear in multiple windows. This is useful for rolling averages.
Session Windows are based on activity, not fixed time. A user logs in, interacts with your app, and logs out. That login session is the window. Session boundaries come from the data itself (like a session token), not from the clock.
Which Time Do You Use?
When you are windowing streaming data, you also need to decide what “time” means. There are three options:
- Event Time is when the thing actually happened. A sensor recorded a car at 1:05 PM.
- Ingest Time is when Kafka received the record. Network latency means this could be slightly later than event time.
- Processing Time is when your pipeline reads and processes the record. This could be significantly later if your consumer is behind.
Which one you pick depends on your use case. If you care about when things actually happened in the real world, use event time. If you care about when your system received the data, use ingest time.
Producing and Consuming with Python
NiFi is not the only way to interact with Kafka. You can write producers and consumers directly in Python.
Crickard uses the confluent-kafka library. There are other options like kafka-python and pykafka, but the patterns are the same regardless of which library you choose.
Writing a Python Producer
The producer workflow has a few steps:
- Create the producer by pointing it at your Kafka brokers
- Define a callback function to handle acknowledgments. Each response is either an error or a success message with the topic, partition, and value
- Loop through your data, serialize each record to JSON, and call
produce()with the topic name, the encoded data, and your callback - Call poll() before each produce to pick up any pending acknowledgments
- Flush at the end to make sure everything gets sent
The producer in the book uses Faker to generate fake user records (name, age, street, city, state, zip) and sends ten of them to the users topic. The callback prints the timestamp, topic, partition number, and the full record value for each successful delivery.
Writing a Python Consumer
The consumer is even simpler:
- Create the consumer with your broker addresses, a group ID, and an offset reset policy (earliest means start from the beginning)
- Subscribe to the topic you want to read
- Enter an infinite loop that calls
poll()on each iteration. Each poll returns one of three things: nothing (no new messages), an error, or a message - Decode the message value from bytes to a string and do something with it
- Close the connection when done
The consumer sits in that loop forever, waiting for messages. When it gets one, it decodes the JSON and prints it. If there is nothing yet, it continues polling. The offset is tracked by Kafka, so if you stop and restart the consumer, it picks up where it left off (assuming you use the same group ID).
Key Takeaways
- Topics are logs split into partitions. Message order is guaranteed per partition, not per topic. Use keys to ensure related messages land in the same partition.
- Consumer groups let you scale reads. Match consumers to partitions for best throughput. More consumers than partitions means wasted resources.
- Multiple consumer groups can read the same topic independently. One group for your dashboard, one for your warehouse, one for your alerting system.
- NiFi makes Kafka integration simple. One processor to publish, one to consume. Configuration is just broker addresses, topic name, and group ID.
- Streaming data is unbounded. You cannot treat it like batch data. Use windowing to create bounded slices you can analyze.
- Windowing has three flavors. Fixed for non-overlapping time slices, sliding for rolling calculations, session for activity-based grouping.
- Time matters in streams. Event time, ingest time, and processing time can all be different. Pick the right one for your use case.
My Take
This chapter is the bridge between “Kafka as infrastructure” (Chapter 12) and “Kafka as part of a real pipeline.” Crickard does a good job of starting from first principles with the log concept and building up to working code.
The NiFi section is practical. If you already have NiFi pipelines, adding Kafka as a data source or sink takes minutes. One processor, a few config fields, done. The ControlRate trick for simulating real-time streaming is a nice touch for learning purposes.
The stream vs. batch processing section is the most important conceptual content in the chapter. If you are coming from a batch processing background, the shift to unbounded data is the biggest mental adjustment. You cannot compute a global average. You cannot validate against the full dataset. You have to think in windows and accept that your view of the data is always partial and always moving.
The Python producer and consumer examples are basic but solid. They show you the structure: create a client, configure it, send or receive data in a loop, handle responses. Once you have that pattern, building something more complex is just adding error handling, serialization logic, and business rules on top.
One thing worth noting: the book uses Confluent’s library, which is backed by the company behind Kafka. If you are in a corporate environment, this is probably the right choice. If you prefer fully open source, kafka-python works fine and follows the same patterns.
The next chapter gets into Apache Spark, which is where the processing of all this streaming data gets serious.