Data Pipelines: Batch vs Streaming and When to Use Each
This is Part 1 of Chapter 7. Part 2 covers orchestration and transformations.
Chapter 7 of Data Engineering for Beginners is probably where things start feeling real. You stop talking about storage and tables and start talking about how data actually moves. And the answer is: through pipelines.
A data pipeline is just a system that moves data from point A to point B. Along the way it might clean, filter, join, or aggregate that data. Nothing fancy in the definition. But the execution? That is where it gets interesting.
Batch Pipelines: The Workhorse
Batch processing is the older, more predictable sibling. You collect a big chunk of data, process it all at once, and store the results. Think monthly financial reports, payroll runs, or nightly data warehouse loads.
Nwokwu uses a fintech company as an example. At the end of each month, a pipeline kicks off at midnight, grabs all transaction data, crunches the numbers, and puts everything into a data warehouse for reporting. Classic batch.
Here is the thing about batch pipelines: they can trigger in two ways. Either on a time schedule (every hour, every day, every week) or when the data hits a certain size (say 10,000 records or 1 GB). The first approach is more common in data warehousing. The second works well when data arrives in unpredictable bursts.
Components of a Batch Pipeline
A typical batch pipeline has these stages:
- Data Sources - where your raw data lives (databases, CSV files, APIs)
- Staging Area - a temporary holding zone before processing. This is your safety net. If the batch job fails, you do not need to re-pull everything from source
- Data Transformation - the business logic layer where cleaning, filtering, and aggregation happen
- Data Storage - where processed data lands (data warehouse, data lake, database)
- Data Consumption - dashboards, reports, ML models that use the processed data
- Job Scheduler - the trigger that makes it all run on time
For simple cases, a cron job on a Unix system works fine as a scheduler. For anything more complex, tools like Apache Airflow give you dependency management, retry logic, and monitoring.
ETL vs ELT
Quick note on two batch processing methods. ETL (Extract, Transform, Load) transforms data before loading it into the destination. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the destination system.
ELT is the more modern approach. It works well with cloud data warehouses that have enough compute power to handle transformations in place. If you are working with big data and cloud infrastructure, ELT is probably what you want.
Stream Pipelines: When Every Millisecond Counts
Now imagine that same fintech company wants fraud detection. They cannot wait until midnight to process transactions. By then, someone already emptied your account and booked a flight. Every millisecond matters.
This is where streaming pipelines come in. They process data continuously, as it arrives, in real time or near real time.
Streaming pipelines have three key traits:
- Event-driven - they react to individual events (a card swipe, a button click) instead of waiting for a scheduled batch
- Continuous processing - data is processed the moment it enters the pipeline
- Low latency - the time between an event happening and the system responding is milliseconds to seconds
How a Streaming Pipeline Works
The architecture looks like this:
Producers generate data continuously. Think IoT sensors, stock trades, social media posts. They push data into the pipeline as events.
Message Queues sit between producers and processors. They buffer data, ensure ordering, and handle delivery guarantees. A message broker manages the queue and follows a Pub/Sub (Publish and Subscribe) pattern.
Here is how Pub/Sub works in plain terms. Producers publish messages to topics (like “order/placed” or “order/shipped”). Consumers subscribe to the topics they care about. The customer app subscribes to all order topics. The warehouse system only subscribes to “order/placed.” Nobody needs to know about each other directly.
If a message does not match any topic, it goes to a dead letter queue (DLQ) instead of crashing the system. Engineers can review these later.
Schema Registry is where the structure of messages is tracked. When someone adds a new field to an event, the schema registry makes sure old consumers do not break. Forward and backward compatibility. Very practical.
Stream Processor is the engine. It filters, aggregates, transforms, and enriches data in real time. And this is where windowing comes in.
Windowing: Making Sense of Infinite Streams
Without windowing, a data stream is infinite. You cannot just keep processing and storing everything forever. Windowing groups events into time-based chunks so you can actually compute useful things like “total transactions in the last 5 minutes.”
Three types of time matter here:
- Event time - when the event actually happened
- Ingestion time - when the system received the event
- Processing time - when the system actually processed the event
A sensor reads temperature at 10:00:00. Due to network delay, the system gets it at 10:00:05. The system processes it at 10:00:10. Those are three different timestamps, and which one you use depends on your use case.
The Four Window Types
Tumbling Windows are fixed-size, non-overlapping chunks. A 30-second tumbling window splits your stream into 0-30s, 30-60s, 60-90s, and so on. Each event belongs to exactly one window. Simple and clean. Good for comparing equal time periods. But pick the wrong size and you either miss details (too big) or drown in noise (too small).
Hopping Windows are fixed-size but they overlap. A 60-second window that hops every 30 seconds means each window overlaps with the previous one. Events can appear in multiple windows. This is great for catching spikes that might fall at the edge of a tumbling window. The tradeoff is more computation since data gets processed multiple times.
Sliding Windows move forward every time a new event arrives. A 5-minute sliding window always shows you the most recent 5 minutes of data, dropping old events as new ones come in. Perfect for continuous monitoring like IoT sensor data. But they need more computational power since the window adjusts constantly.
Session Windows are different. They group events by activity. A session opens when an event arrives and stays open as long as events keep coming within a timeout period. When there is a gap (inactivity), the session closes. New event? New session. Great for tracking user engagement in apps or games. The challenge is picking the right timeout. Too short and you split sessions that belong together. Too long and unrelated events end up in the same session.
Other Stream Processing Features
The book also covers checkpointing (saving progress so the system can recover from crashes without reprocessing everything), watermarking (handling late-arriving events by defining a lateness threshold), and stateful processing (keeping track of information across multiple events, like a user’s spending pattern for fraud detection).
Lambda Architecture: Why Not Both?
Here is the problem. What if you need both real-time dashboards AND historical reports? A streaming pipeline alone lacks historical depth. A batch pipeline is too slow for live insights.
The Lambda architecture combines both. It has three layers:
- Batch Layer - stores the full raw data and processes it on schedule for accuracy and completeness
- Speed Layer - processes data in real time for low-latency insights, filling the gaps between batch runs
- Serving Layer - merges results from both layers so users can query a complete picture of historical and recent data
For the e-commerce company, the speed layer powers real-time dashboards while the batch layer handles historical analysis. The serving layer lets executives ask “How many visitors today?” and “How does that compare to last month?” from the same interface.
The advantages are real: fault tolerance (if streaming crashes, batch still has your data), scalability, and flexibility. But it is not free. You maintain two separate processing systems, which means more complexity, potential data duplication, and harder debugging.
In the next part, we will look at how orchestration tools like Apache Airflow help you manage all of this, plus hands-on data transformation techniques.
This is part 10 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Data Warehouses, Lakes, and Lakehouses | Next: Pipeline Orchestration and Transformations > |