Big data

Feb 26, 2026
software-engineering

Data Processing With Apache Spark - Study Notes From Data Engineering With Python Ch 14

You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.

Feb 25, 2026
software-engineering

Big Data and Distributed Systems - Chapter 11 Retelling

At some point, your data gets too big for one machine. That’s not a hypothetical. Netflix, Google, Amazon, they all hit that wall years ago. The question is: what do you do when a single server can’t keep up?

Feb 17, 2026
software-engineering

Data Science Foundations Chapter 7: Where to Find and How to Source Your Data

You have a great hypothesis. Your stakeholders are on board. But none of it matters without the right data.

Chapter 7 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about sourcing. Where do you get data? How do you collect it? How do you know if it is any good?

Jan 23, 2019
Big Data

Wrapping Up: The Future of Big Data Analytics

Previous: Elastic MapReduce: Running Hadoop in the AWS Cloud

We’ve covered a lot of ground in this series. From the basic blocks of HDFS to the real-time speeds of Flink and the limitless scale of the AWS cloud. After spending a lot of time with Sridhar Alla’s Big Data Analytics with Hadoop 3, I have a few final thoughts to share.

Jan 22, 2019
Big Data

Elastic MapReduce: Running Hadoop in the AWS Cloud

Previous: Mastering AWS for Big Data: EC2, S3, and EMR

In the last post, we looked at the basic building blocks of AWS: EC2 and S3. But if you’re trying to run a massive Hadoop or Spark cluster, you don’t really want to be manually installing software on hundreds of individual EC2 instances. That’s where Amazon EMR (Elastic MapReduce) comes in.

Jan 21, 2019
Big Data

Mastering AWS for Big Data: EC2, S3, and EMR

Previous: Comparing the Giants: AWS, Azure, and Google Cloud

We’ve talked about the “what” and the “why” of the cloud. Now it’s time for the “how.” Chapter 12 of Sridhar Alla’s book is a deep look at Amazon Web Services (AWS), which is essentially the playground where most big data pros spend their time.

Jan 20, 2019
Big Data

Comparing the Giants: AWS, Azure, and Google Cloud

Previous: Cloud Computing for Big Data: An Introduction

In the last post, we looked at the basic models of the cloud (IaaS, PaaS, and SaaS). Today, we’re talking about the “where” and the “who.” When you decide to move your big data to the cloud, you have to choose a deployment model and a provider.

Jan 19, 2019
Big Data

Cloud Computing for Big Data: An Introduction

Previous: Visualizing Big Data: Turning Numbers into Insight

We’ve spent this entire series talking about how to set up and run your own Hadoop cluster. But let’s be real: managing hardware is a pain. You have to buy servers, set up networking, worry about power outages, and pray that your hard drives don’t fail.

Jan 18, 2019
Big Data

Visualizing Big Data: Turning Numbers Into Insight

Previous: Flink Connectors and Event Time: Mastering the Stream

You’ve done the hard work. You’ve set up a Hadoop cluster, written MapReduce jobs, and built real-time pipelines in Spark and Flink. You have “insights.” But here’s the problem: nobody wants to look at a raw HDFS file or a console log.

Jan 17, 2019
Big Data

Flink Connectors and Event Time: Mastering the Stream

Previous: Stream Processing with Apache Flink: True Real-Time Analytics

In the last post, we looked at Flink’s DataStream API. Today, we’re tackling the big questions: How does Flink handle the messy reality of the real world? How does it talk to other systems? And how does it deal with data that shows up late?

Jan 16, 2019
Big Data

Stream Processing With Apache Flink: True Real-Time Analytics

Previous: Flink DataSet API: Transformations, Joins, and Aggregations

We’ve talked about how Spark handles streaming using micro-batches. It’s a great approach, but some people argue it’s not “true” streaming. If you need the absolute lowest latency possible, you want Apache Flink.

Jan 15, 2019
Big Data

Flink DataSet API: Transformations, Joins, and Aggregations

Previous: Batch Analytics with Apache Flink: The New Challenger

In the last post, we got Flink up and running. Now, let’s actually do something useful with it. Chapter 8 of Sridhar Alla’s book focuses on the DataSet API, which is what you’ll use for all your batch processing needs.

Jan 14, 2019
Big Data

Batch Analytics With Apache Flink: The New Challenger

Previous: Structured Streaming: The Modern Way to Handle Data Streams

We’ve spent a lot of time on Spark, and for good reason - it’s amazing. But if you’re serious about big data, you need to know about Apache Flink. In Chapter 8, Sridhar Alla introduces us to the technology that many experts consider the “true” successor to MapReduce for real-time processing.

Jan 13, 2019
Big Data

Structured Streaming: The Modern Way to Handle Data Streams

Previous: Real-Time Analytics with Spark Streaming

In the last post, we looked at DStreams, the original way to do streaming in Spark. But things move fast in the tech world. Spark 2.0 introduced Structured Streaming, a new way to handle real-time data that makes things even simpler and more reliable.

Jan 12, 2019
Big Data

Real-Time Analytics With Spark Streaming

Previous: Spark SQL and Aggregations: Joining Your Data at Scale

Up until now, we’ve mostly talked about batch processing - looking at data that’s already sitting in HDFS. But what if you need to know what’s happening right now? What if you’re tracking a stock price, monitoring a server for hacks, or following a trending hashtag on Twitter? That’s where Spark Streaming comes in.

Jan 11, 2019
Big Data

Spark SQL and Aggregations: Joining Your Data at Scale

Previous: Batch Analytics with Apache Spark: Faster Than MapReduce

In the last post, we looked at why Spark is so fast. Today, we’re getting into the nitty-gritty of how to actually use it. If you’re a SQL fan, you’re going to love this. Chapter 6 of Sridhar Alla’s book spends a lot of time on Spark SQL, and for good reason - it’s where most of the work happens.

Jan 10, 2019
Big Data

Batch Analytics With Apache Spark: Faster Than MapReduce

Previous: Statistical Computing with R and Hadoop

If you’ve been following this series, you know we’ve spent a lot of time on MapReduce. It’s the foundation of Hadoop, but let’s be honest: it can be slow and painful to write. That’s why Chapter 6 of Sridhar Alla’s book is such a breath of fresh air. It introduces Apache Spark, the technology that has effectively dethroned MapReduce for most big data tasks.

Jan 09, 2019
Big Data

Statistical Computing With R and Hadoop

Previous: Scientific Computing with Python and Hadoop

If Python is the general-purpose king of data science, R is the specialized wizard of statistics. While Python is great for building pipelines and apps, R was built by statisticians, for statisticians. In Chapter 5, Sridhar Alla shows us how to bring that statistical power to the massive datasets sitting in Hadoop.

Jan 08, 2019
Big Data

Scientific Computing With Python and Hadoop

Previous: Advanced MapReduce: Joins and Filtering Patterns

Java and MapReduce are great for the heavy lifting, but when it comes to actually exploring data and building models, Python is where it’s at. Chapter 4 of Sridhar Alla’s book shifts the focus to how we can use Python’s massive ecosystem to analyze big data.

Jan 06, 2019
Big Data

Deep Look at MapReduce: How Hadoop Processes Data

Previous: SQL on Hadoop: Getting Started with Apache Hive

We’ve talked about Hive, but today we’re going under the hood. MapReduce is the engine that actually does the heavy lifting in Hadoop. Sridhar Alla’s third chapter is a deep look at how this framework takes a massive pile of data and turns it into something useful.

Jan 05, 2019
Big Data

SQL on Hadoop: Getting Started With Apache Hive

Previous: The World of Big Data Analytics: Processes and Tools

If you’ve ever tried to write a MapReduce job just to count the number of lines in a file, you know it’s a lot of work. You have to write a Mapper, a Reducer, a Driver… it’s a whole thing.

Jan 04, 2019
Big Data

The World of Big Data Analytics: Processes and Tools

Previous: Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Now that we’ve got a cluster running, let’s talk about why we bother with all this complexity in the first place. Chapter 2 of Sridhar Alla’s book takes a step back to look at the big picture of data analytics.

Jan 02, 2019
Big Data

Getting Started With Hadoop 3: What's New and Why It Matters

Previous: Big Data for the Rest of Us

Hadoop has been around for a while, but version 3 is where things get really interesting. If you’ve worked with Hadoop 1 or 2, you know it was solid but had some pain points. Sridhar Alla’s book kicks off by looking straight at what’s changed.

Jan 01, 2019
Big Data

Big Data for the Rest of Us: A Deep Look at Hadoop 3

So, you’ve heard about big data. It’s everywhere. But how do you actually handle it? If you’re looking for the OG of big data platforms, you’re looking at Hadoop. And honestly, it’s still the foundation for almost everything we do in data today.