Building an End-to-End Big Data Pipeline - Part 2
In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.
This isn’t just about running a script; it’s about a multi-stage process that follows the Medallion Architecture (Bronze, Silver, Gold).
Phase 1: Landing the Data
The first step is Data Acquisition. We use an Airflow task to download five massive TSV files from the IMDB servers and upload them directly to our S3 Landing zone.
Why S3? Because it’s cheap, virtually infinite, and outlasts any specific processing job.
Phase 2: The Bronze Layer
Raw TSV files are great for transport but terrible for processing. They are slow to read and don’t have built-in schemas.
Airflow triggers our first Spark job to read these raw files and convert them into Parquet format in the Bronze bucket. Parquet is a columnar storage format that makes Spark much faster and significantly reduces your storage costs on AWS.
Phase 3: The Silver Layer (OBT)
Now for the complex part. We have five different tables (names, basics, ratings, etc.). To make this data useful for analysts, we want to join them into One Big Table (OBT).
Airflow triggers a second Spark job that:
- Reads the Parquet files from the Bronze bucket.
- Explodes columns that contain multiple values (like director names).
- Joins everything based on unique IDs.
- Writes the final, consolidated table to the Silver bucket.
Phase 4: Cataloging for the World
The data is now ready, but Trino doesn’t know it exists yet. The final step in our DAG is to trigger an AWS Glue Crawler.
The crawler scans the new files in the Silver bucket, updates the schema in the Glue Data Catalog, and—voilà—the data is immediately available for SQL queries in Trino or DBeaver.
The Power of Automation
By the end of this workflow, you have a fully automated system. You push your code to Git, Airflow pulls it, triggers the Spark jobs on Kubernetes, and your analysts see fresh data in their dashboards.
But what if you can’t wait for the next batch run? In the next post, we’re going to look at the Real-Time Pipeline using Kafka and Spark Structured Streaming.
Next: Building an End-to-End Big Data Pipeline - Part 3 Previous: Building an End-to-End Big Data Pipeline - Part 1
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0