Stream Processing with Apache Flink: True Real-Time Analytics
Previous: Flink DataSet API: Transformations, Joins, and Aggregations
We’ve talked about how Spark handles streaming using micro-batches. It’s a great approach, but some people argue it’s not “true” streaming. If you need the absolute lowest latency possible, you want Apache Flink.
In Chapter 9, Sridhar Alla takes us into the world of Flink’s DataStream API. This is where Flink really shines.
Streaming vs. Micro-batching
In Spark, you wait a few seconds, collect a bunch of events, and then process them as a batch. Flink doesn’t wait. It processes each event as it arrives. This “event-at-a-time” processing is what allows Flink to achieve sub-second latency.
But Flink isn’t just fast; it’s smart. It provides:
- Exactly-once stateful processing: Even if a node fails, Flink ensures your totals and counts remain accurate.
- Handling Out-of-Order Data: It uses some very clever math to handle events that arrive later than they should.
The DataStream API
Writing a Flink streaming program feels very similar to writing a batch program. You get an execution environment, define your source, apply transformations, and send the result to a sink.
Here’s a simple example of a “Socket Window WordCount” in Flink:
val text = senv.socketTextStream("localhost", 9000)
val counts = text
.flatMap { _.split("\s") }
.map { WordWithCount(_, 1) }
.keyBy("word")
.timeWindow(Time.seconds(5))
.sum("count")
This code connects to a network socket, splits the text into words, and counts them in a 5-second sliding window. It’s concise, readable, and incredibly powerful.
Windows: Slicing the Infinite
Since streams are infinite, you often need to “slice” them into chunks to do any useful math. Flink gives you a few different types of windows:
- Tumbling Windows: Fixed-size, non-overlapping (e.g., every 10 minutes).
- Sliding Windows: Overlapping windows (e.g., look at the last 10 minutes, but update every 1 minute).
- Session Windows: Group events together based on periods of activity (e.g., all the clicks a user makes during a single visit to your site).
Flink’s windowing system is probably the most flexible in the industry. It allows you to model real-world behavior much more accurately than a simple batch-based system.
In the next post, we’ll look at how Flink connects to the rest of the world (like Kafka and Cassandra) and how it masters the tricky problem of “Event Time.”