Decoding Kubernetes Architecture - Part 2
In the last post, we talked about the “brain and muscles” of a Kubernetes cluster. But how do we actually tell that brain what to do? We use Objects.
Kubernetes objects are persistent entities that represent the state of your cluster. Think of them as the building blocks you use to construct your data platform. Neylson Crepalde’s book does a great job of explaining which blocks to use for which task.
The Smallest Unit: Pods
A Pod is the smallest thing you can deploy in Kubernetes. It usually wraps a single container (like your Python data job). Here is the thing: you almost never create Pods directly. Why? Because Pods are mortal. If they die, they stay dead.
To make them resilient, we use higher-level controllers.
Deployments vs. StatefulSets
This is a classic distinction in data engineering:
- Deployments are for stateless workloads. If you have an NGINX frontend or a simple API, use a Deployment. If one replica dies, Kubernetes just spins up an identical one. They are interchangeable.
- StatefulSets are for stateful workloads. Think databases (MySQL, Postgres) or distributed systems (Kafka, Cassandra). These need “sticky” identities and persistent storage. If a database pod dies, it needs to come back with the same name and the same data disk attached.
The Data Engineer’s Best Friend: Jobs
For those of us running batch processing, Jobs are essential. Unlike a web server that runs forever, a Job is meant to run to completion and then stop. This is exactly what you want for an ETL task or a machine learning training session.
Connecting the Dots: Services and Routing
Once your pods are running, how does anyone talk to them?
- Services: Provide a stable IP address or DNS name. Since pods can die and move, you need a Service to act as a permanent front door.
- Ingress & Gateway API: These are the “traffic controllers” that handle external access (like HTTP/HTTPS). The book mentions that the Ingress API is actually frozen and being replaced by the more powerful Gateway API, though most big data tools still use Ingress for now.
Storage and Config
Finally, we have the “supporting cast”:
- Persistent Volumes (PV): These decouple your storage from the pod. Your data lives on a disk that outlasts any individual container.
- ConfigMaps & Secrets: Never hardcode your database URLs or passwords in your code. Use ConfigMaps for general settings and Secrets for the sensitive stuff.
Understanding these objects is like learning the vocabulary of a new language. Once you know what a Job or a PersistentVolumeClaim is, you can start describing complex data architectures that Kubernetes can actually build for you.
Next: Local Kubernetes with Kind Previous: Decoding Kubernetes Architecture - Part 1
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0