Building Your Own Data Images
In my last post, we talked about why containers are the bedrock of modern data engineering. But honestly, just running other people’s images only gets you so far. The real magic happens when you start packaging your own custom code.
Neylson Crepalde’s book walks through a very practical example of this: containerizing a Python batch processing job. If you’ve ever struggled with dependency hell, this is your way out.
The Batch Processing Job
Imagine you have a simple Python script (run.py) that uses pandas to process some CSV data from a URL. It looks something like this:
import pandas as pd
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
df = pd.read_csv(url, header=None)
df["newcolumn"] = df[5].apply(lambda x: x*2)
print(df.head())
To run this normally, you’d need Python and pandas installed on your host. But with Docker, we create a Dockerfile to define the environment once and for all.
Crafting the Dockerfile
Here is the “recipe” for our data job:
FROM python:3.11.6-slim
RUN pip3 install pandas
COPY run.py /run.py
CMD python3 /run.py
Let’s break down those four lines:
- FROM: We start with a official Python “slim” image. Pro tip: always use slim images for data jobs. They keep your container size small, which means faster transfers and lower costs.
- RUN: This installs our dependencies (
pandas). - COPY: We move our script from our local machine into the container.
- CMD: This is the default command that runs when the container starts.
Building and Running
Once your Dockerfile is ready, you build the image with a tag:
docker build -f Dockerfile_job -t data_processing_job:1.0 .
And then you run it:
docker run --name data_processing data_processing_job:1.0
And just like that, your job runs in a perfectly isolated environment. No version conflicts, no “missing library” errors.
Why this matters for Big Data
This might seem simple for one script, but imagine you have a complex pipeline with Spark, Kafka, and specialized ML libraries. Being able to bundle each part of the pipeline into its own container is how you scale without losing your mind.
But data jobs aren’t just batch scripts. Sometimes you need to serve that data via an API. In the next post, we’ll dive into how to containerize services and finally start talking about the big orchestrator itself: Kubernetes.
Next: Decoding Kubernetes Architecture - Part 1 Previous: Why Containers are a Must for Data Engineers
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0