Big Data and Distributed Systems - Chapter 11 Retelling
At some point, your data gets too big for one machine. That’s not a hypothetical. Netflix, Google, Amazon, they all hit that wall years ago. The question is: what do you do when a single server can’t keep up?
Chapter 11 answers that question. It covers what makes data “big,” how distributed systems work, and the two frameworks that defined modern data processing: Hadoop and Apache Spark.
The Five V’s of Big Data
Big data isn’t just “a lot of data.” It has specific characteristics. The book walks through five of them:
Volume - the sheer amount. Meta processes terabytes daily. Traditional databases can’t handle that scale.
Velocity - the speed. Data streams in real time now. Fraud detection and trading systems need answers in seconds, not hours.
Variety - the formats. Not just rows and columns. JSON, XML, videos, images, social media posts. Most valuable big data is semi-structured or unstructured.
Veracity - the quality. More data means more noise. Duplicates, incomplete entries, fake accounts. Faulty data means faulty decisions.
Value - the point. Collecting data is meaningless unless you extract something useful. Spotify recommends songs from your listening habits. The patterns only appear at scale.
Distributed Systems: The Short Version
A distributed system is a collection of independent computers that look like one system to the user. A big task arrives, gets split into chunks, each chunk goes to a different node, they all work simultaneously, and the results get combined. The user sees one output as if one computer did it all.
Key Properties
The book covers nine properties. Here are the key ones:
Scalability - vertical (bigger machine) or horizontal (more machines). Vertical has limits. Horizontal is what powers big data.
Fault tolerance - the system keeps working when nodes crash. Spark uses checkpointing to resume from the last successful state.
Consistency - all nodes have the same data. Strong consistency is instant. Eventual consistency (Twitter, Facebook) syncs up over time.
Availability - the system stays up. AWS promises 99.99% uptime across multiple data centers.
Load balancing - distributes work evenly so no node gets overwhelmed.
Hadoop: Where It All Started
In 2003-2004, Google published papers about the Google File System and MapReduce. Doug Cutting and Mike Cafarella built an open source version called Hadoop. It has three main components:
HDFS (Hadoop Distributed File System)
HDFS is the storage layer. It takes large files, breaks them into blocks (usually 128 MB or 256 MB each), and spreads those blocks across multiple machines.
The architecture has two key parts:
- NameNode - the master. It tracks which data blocks live on which machines. It doesn’t store actual data, just metadata.
- DataNodes - the workers. They store the actual data blocks.
Every block is replicated three times across different nodes. If one node dies, the data still exists on two others. No single point of failure.
MapReduce
MapReduce is the processing engine. A large file gets split into blocks. The Map phase processes each block independently, producing key-value pairs. Shuffle and sort groups values by key. The Reduce phase aggregates grouped values into final results. Classic example: counting words. Map outputs ("hello", 1) for each occurrence. Reduce adds them up: ("hello", 47).
The Problem with MapReduce
MapReduce worked, but it was slow. After every step, it wrote results to disk, then the next step read from disk again. Like saving your essay after every sentence and reopening the file. It also couldn’t handle real-time workloads or iterative tasks like machine learning.
Apache Spark: The Faster Way
Spark was created to fix MapReduce’s problems. Open sourced in 2010, Apache top-level project by 2014. Up to 100 times faster than MapReduce. The secret? In-memory computing.
RDDs, DataFrames, and Lazy Evaluation
The core data structure is the RDD (Resilient Distributed Dataset), which splits data across machines for parallel processing. Unlike MapReduce, Spark caches RDDs in memory. Later, Spark added DataFrames (tables with rows and columns) and Datasets (type-safe tables for Scala).
Spark uses lazy evaluation: when you filter or sort, it doesn’t run right away. It builds a plan. Only when you ask for results does it execute, choosing the most efficient path.
Spark Architecture
Spark follows a driver-worker model:
Driver Program
|
v
Cluster Manager (YARN, Kubernetes, or Standalone)
|
v
Worker Nodes -> Executors -> Tasks
The Driver Program runs your code and builds a DAG (Directed Acyclic Graph), which is the execution plan. The Cluster Manager (YARN, Kubernetes, or standalone) allocates resources. Executors on worker nodes do the actual computation in parallel and cache data in memory. The DAG Scheduler breaks the plan into stages of parallelizable tasks.
Spark vs MapReduce
| Feature | Spark | MapReduce |
|---|---|---|
| Speed | In-memory, much faster | Disk-based, slower |
| Processing | DAG for optimized execution | Step-by-step, sequential |
| Languages | Python, Scala, Java, R, SQL | Mostly Java |
| Fault tolerance | RDD lineage for recovery | HDFS replication |
| Use cases | Batch, streaming, ML, graph | Batch only |
| ML support | Built-in MLlib | None |
Spark wins on almost every metric. MapReduce is still around in legacy systems, but for new projects, Spark is the default.
Big Data File Formats
When you work with big data, CSV and JSON won’t cut it. You need specialized formats designed for scale. The book covers three:
Avro
Row-based, binary format. Stores the schema alongside the data (in JSON). Best for write-heavy workloads and real-time streaming (like Kafka pipelines). Great schema evolution support, meaning you can add or remove fields without breaking things.
Parquet
Column-based, binary format. The go-to for analytics. Need three columns out of fifty? Parquet reads just those three. Compresses well too, because similar values in a column pack tighter than mixed values in a row. Widely supported across AWS, GCP, and most analytics tools.
ORC (Optimized Row Columnar)
Another column-based format, optimized for Hadoop and Hive. Built-in indexes for faster queries. Sometimes smaller files than Parquet, but less cross-platform support.
Which One to Pick?
Analytics queries? Parquet or ORC. Write-heavy or streaming? Avro. AWS? Parquet is the default. Hadoop/Hive? ORC. Schema evolution? Avro handles it best.
My Take
This is where data engineering starts to feel different from regular software engineering. You’re not just writing code. You’re thinking about how data moves across machines and how to handle failures.
Hadoop is good for historical context, and you’ll still see HDFS in older systems. But Spark is where you should invest your learning time. Don’t memorize every property of distributed systems. Understand the trade-offs instead. You can’t have perfect consistency and perfect availability at the same time (that’s the CAP theorem). Knowing that will serve you better than definitions.
This is part 15 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Data Governance | Next: Cloud Data Engineering > |