Why Containers are a Must for Data Engineers

Mar 06, 2026
Big Data

If you are working with data today, you can’t really ignore containers. They have become the standardized unit for how we develop, ship, and deploy software. But why do we care so much about them in the big data world?

In the first part of Neylson Crepalde’s book, he breaks down the “why” and “how” of containers, and it’s a great refresher for anyone trying to build stable data pipelines.

What’s the big deal?

The world is exploding with data. We are talking about mobile devices, social media, sensors—everything is pumping out info. Managing the complexity of storing and processing this “big data” is a nightmare if you are doing it on raw servers.

Kubernetes helps by automating the deployment and scaling, but the foundation of Kubernetes is the container. Containers let you package your code and all its dependencies into one neat package. No more “it works on my machine” excuses.

Containers vs. Virtual Machines

One thing that often confuses people is the difference between a Container and a Virtual Machine (VM).

Here is the simple version:

VMs virtualize at the hardware level. Each one has its own full operating system. They are heavy and slow to start.
Containers virtualize at the OS level. They share the host’s kernel, making them incredibly light and fast.

This matters for data engineering because we often need to spin up hundreds of small tasks. If each one had to boot a whole OS, we’d never get anything done.

Getting Hands-On with Docker

The book suggests getting comfortable with Docker first. It’s the industry standard for a reason. One of my favorite examples from the chapter is running a Julia environment without actually installing Julia.

If you have Docker installed, you can just run:

docker run -it --rm julia:1.9.3-bullseye

And just like that, you are in a Julia REPL. You can write functions, do math, and play with data without ever touching your system’s global configuration. When you exit, the --rm flag tells Docker to clean up everything. It’s like it was never there.

Why this matters for your pipelines

This isolation is key. You can run NGINX, Python, or even legacy tools in their own little bubbles. They won’t fight over libraries or versions.

But running someone else’s image is just the start. The real power comes when you start building your own images for your specific data jobs.

Next: Building Your Own Data Images Previous: Rethinking Data Infrastructure: Big Data on Kubernetes

Book Details:

Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Author: Neylson Crepalde
ISBN: 978-1-83546-214-0

#big-data-on-kubernetes #neylson-crepalde #book-retelling #docker #containers #data-engineering

Why Containers are a Must for Data Engineers

What’s the big deal?

Containers vs. Virtual Machines

Getting Hands-On with Docker

Why this matters for your pipelines

About

About BookGrill.net

Category

Tags View all tags

Theme Settings

Accent Color