Decoding Kubernetes Architecture - Part 1
If you want to run big data workloads on Kubernetes, you have to understand how the system is actually put together. It’s not just “magic magic cloud stuff”—it’s a carefully coordinated cluster of machines.
In Neylson Crepalde’s book, he breaks down the architecture into two main groups: the Control Plane (the brain) and the Worker Nodes (the muscles).
The Brain: The Control Plane
The Control Plane is responsible for making global decisions about the cluster. It’s where the “desired state” of your system lives. If you tell Kubernetes you want 5 copies of a Spark worker running, the Control Plane is what makes sure that actually happens.
Here are the key players:
- kube-apiserver: Think of this as the “frontend” of the cluster. It’s the only way we (or any other component) talk to Kubernetes. It’s highly scalable and very secure.
- etcd: This is the cluster’s memory. It’s a distributed key-value store that keeps track of everything—every node, every pod, every config. If etcd goes down, the cluster loses its mind.
- kube-scheduler: This component is the delegator. It looks at new tasks and decides which worker node has enough “gas in the tank” (CPU/RAM) to run them.
- kube-controller-manager: The regulator. It runs background loops that compare the actual state of the cluster with the state you asked for. If a node fails, the controller manager notices and starts the recovery process.
The Muscles: Worker Nodes
While the Control Plane is busy thinking, the Worker Nodes are busy doing the actual work. Every node in your cluster has a few essential components:
- Container Runtime: This is the software that actually runs the containers (like Docker or containerd). It pulls the images and manages their lifecycle.
- kubelet: The “on-site manager.” It’s an agent that runs on every node and makes sure the containers that were assigned to that node are actually running and healthy.
- kube-proxy: The networker. It handles all the networking rules so your pods can talk to each other and the outside world. It’s essentially a high-tech load balancer.
Why this matters for Data Workloads
When you are running a massive data processing job, the scheduler is your best friend. It knows exactly where to place your Spark executors so they don’t fight for resources. And if a node crashes mid-job? The controller manager and kubelet work together to spin up a replacement immediately.
Understanding this “brain vs. muscle” dynamic is the first step toward building resilient data platforms. In the next post, we’ll look at the actual objects we deploy into this architecture—Pods, Deployments, and Services.
Next: Decoding Kubernetes Architecture - Part 2 Previous: Building Your Own Data Images
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0