Posts

Apr 26, 2026
software-engineering

Terraform Cert Guide Chapter 8: Understanding Terraform Configuration Files

Chapter 8 of Ravi Mishra’s book is one of those chapters that sounds basic on the surface but actually ties a lot of loose ends together. You’ve been writing Terraform code for seven chapters now, but this is where you stop and really understand the anatomy of a configuration file. What goes where, why it matters, and how the same patterns work across GCP, AWS, and Azure.

Apr 26, 2026
software-engineering

Golang DSA Chapter 9 Part 1: Graphs and Network Representation

Chapter 9 of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi shifts gears into graphs and network structures. If you’ve been following along, we spent the last few chapters on searching, sorting, and hashing. Now we’re getting into something that models the real world more directly: connections between things.

Apr 25, 2026
software-engineering

Terraform Cert Guide Chapter 7 Part 2: Building Real Modules for AWS, Azure, and GCP

Part 1 of this chapter covered the theory: what modules are, where they come from, meta-arguments, and source types. Now it’s time to actually build stuff. Part 2 is where Ravi Mishra walks you through creating real Terraform modules for all three major cloud providers: Azure, AWS, and GCP.

Apr 25, 2026
software-engineering

Golang DSA Chapter 8 Part 2: Searching, Recursion, and Hashing

Welcome back. In Part 1 we covered the sorting side of Chapter 8, from bubble sort all the way to quick sort. Now we’re picking up the second half: searching algorithms, recursion, and hashing. These are the tools you use when you already have your data and need to find stuff in it, or when you need to transform it for fast lookups.

Apr 24, 2026
software-engineering

Terraform Cert Guide Chapter 7 Part 1: Understanding Terraform Modules and Sources

Chapter 7 is where the book gets into one of the most practical parts of Terraform: modules. If you’ve been writing Terraform configs so far and copy-pasting blocks between projects, this chapter is basically the answer to “there has to be a better way.”

Apr 24, 2026
software-engineering

Golang DSA Chapter 8 Part 1: Sorting Algorithms in Go

Chapter 8 of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi is called “Classic Algorithms.” It covers sorting, searching, recursion, and hashing. That’s a lot of ground, so we’re splitting it into two parts. This first part is all about sorting.

Apr 23, 2026
software-engineering

Terraform Cert Guide Chapter 6: How Teams Actually Use Terraform Workflows

Chapter 6 is where the book stops talking about individual commands and starts showing you how they all fit together. If the previous chapters were about learning the notes, this one is about playing the song.

Apr 23, 2026
software-engineering

Golang DSA Chapter 7 Part 2: Sequences and Anti-Patterns

Welcome back. In Part 1 we went through dictionaries and TreeSets. This second half of Chapter 7 wraps up TreeSets with synchronized and mutable variants, then moves into some cool mathematical sequences implemented in Go. We also talk about common anti-patterns the book warns about when working with these data structures.

Apr 22, 2026
software-engineering

Terraform Cert Guide Chapter 5: Every Terraform CLI Command You Need to Know

Chapter 5 is where the book gets hands-on with the Terraform CLI. Up to this point you’ve been learning concepts. Now it’s time to learn what you actually type into the terminal.

Apr 22, 2026
software-engineering

Golang DSA Chapter 7 Part 1: Dictionaries and TreeSets

We’re into Chapter 7 of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi, and this is where things get interesting. The chapter is about dynamic data structures, which are basically collections that can grow and shrink as needed. No fixed sizes, no guessing how much memory you need upfront.

Apr 21, 2026
software-engineering

Data Engineering With AWS Chapter 13: Enabling AI and Machine Learning

This is post 19 in my Data Engineering with AWS retelling series.

Throughout this book, we have been ingesting data, transforming data, storing data, querying data, and visualizing data. All of that is incredibly useful on its own. But there is a whole other level where data gets really powerful: when you use it to teach a machine to make predictions.

Apr 21, 2026
software-engineering

Terraform Cert Guide Chapter 4 Part 2: Loops, Functions, and Debugging

The second half of Chapter 4 in Ravi Mishra’s Terraform certification guide covers three things that will make your Terraform code way less repetitive and way more debuggable: loops, built-in functions, and debugging tools.

Apr 21, 2026
software-engineering

Golang DSA Chapter 6 Part 2: Circular Lists and Ordered Lists

Welcome back to Chapter 6 of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi. In Part 1 we covered singly and doubly linked lists. Now we pick up where we left off: circular linked lists, ordered lists, and unordered lists.

Apr 20, 2026
software-engineering

Terraform Cert Guide Chapter 4 Part 1: Backends and Provisioners Explained Simply

Chapter 4 is where the book gets serious. We’re past the basics of providers, resources, and variables. Now we’re talking about where Terraform stores its memory (backends) and how you can run scripts as part of your infrastructure deployments (provisioners).

Apr 20, 2026
software-engineering

Golang DSA Chapter 6 Part 1: Singly and Doubly Linked Lists

Chapter 6 of “Learn Data Structures and Algorithms with Golang” is all about heterogeneous data structures. That’s a fancy way of saying “data structures that can hold different types of data.” Think integers, floats, strings, whatever you need, all mixed together. Linked lists and ordered lists are the main examples here.

Apr 19, 2026
software-engineering

Terraform Cert Guide Chapter 3 Part 2: Outputs and Data Sources for All Clouds

In Part 1 we covered providers, resources, and variables. Now let’s talk about the other two pieces of the puzzle: outputs and data sources. These are the things that make your Terraform configs actually talk to each other and to existing infrastructure.

Apr 19, 2026
software-engineering

Golang DSA Chapter 5 Part 2: Matrix Operations in Go

Welcome back to Part 2 of Chapter 5 from “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi. In Part 1 we looked at arrays and multi-dimensional arrays. Now we get into the fun stuff: matrix operations, special matrix types, and tensors.

Apr 18, 2026
software-engineering

Terraform Cert Guide Chapter 3 Part 1: Providers, Resources, and Variables Explained

Chapter 3 of Ravi Mishra’s certification guide is where Terraform stops being theoretical and starts getting real. You installed it in Chapter 2, now you actually write configuration code. This first half covers three fundamentals: providers, resources, and input variables across Azure, AWS, and GCP.

Apr 18, 2026
software-engineering

Golang DSA Chapter 5 Part 1: Arrays and Multi-Dimensional Arrays

Chapter 5 of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi shifts gears from trees and hash tables into something more math-heavy: homogeneous data structures. That basically means data structures where every element is the same type. Think arrays of integers, matrices of floats, that kind of thing.

Apr 17, 2026
software-engineering

Terraform Cert Guide Chapter 2: How to Install Terraform Step by Step

Chapter 2 of the book is the “roll up your sleeves” chapter. No more theory about what IaC is or why Terraform exists. Now you actually install the thing and get it running on your machine.

Apr 17, 2026
software-engineering

Golang DSA Chapter 4 Part 2: Hash Tables and Hash Functions

Welcome back. In Part 1 we covered binary search trees and AVL trees. Now we’re finishing up Chapter 4 of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi. This second half covers B-trees, B+ trees, tables, containers, circular linked lists, and hash functions.

Apr 16, 2026
software-engineering

Terraform Cert Guide Chapter 1: What Is Infrastructure as Code and Why Should You Care?

Chapter 1 of Ravi Mishra’s book sets the stage for everything that follows. Before you touch a single Terraform command, you need to understand what Infrastructure as Code actually is and why it exists. Let’s break it down.

Apr 16, 2026
software-engineering

Golang DSA Chapter 4 Part 1: Trees in Go

Up to this point in the book, everything we covered was linear. Lists, stacks, queues, heaps, all of them store data in a straight line. One element after another. Chapter 4 is where things get interesting because we’re moving into non-linear data structures.

Apr 15, 2026
software-engineering

I'm Retelling the HashiCorp Terraform Certification Guide - Here's Why

So I picked up this book called HashiCorp Infrastructure Automation Certification Guide by Ravi Mishra. It’s published by Packt (ISBN: 978-1-80056-597-5) and it’s basically a full walkthrough of Terraform, from zero to certification-ready.

Apr 15, 2026
software-engineering

Golang DSA Chapter 3 Part 2: Stacks, Queues, and Heaps

Welcome back. In the first half of Chapter 3, we covered linked lists, sets, and tuples. Now we’re finishing the chapter with queues, stacks, and heaps. These three come up constantly in real software, from job schedulers to undo buttons to sorting algorithms. Let’s get into it.

Apr 14, 2026
software-engineering

Data Engineering With AWS Chapter 12: Visualizing Data With Amazon QuickSight

This is post 18 in my Data Engineering with AWS retelling series.

We have spent eleven chapters ingesting data, transforming data, cataloging data, querying data. But here is a simple truth: nobody wants to stare at 10,000 rows in a spreadsheet. Our brains are not built for that. We process pictures way faster than text. A well-designed chart can tell you in two seconds what would take twenty minutes to figure out from raw numbers.

Apr 14, 2026
software-engineering

Golang DSA Chapter 3 Part 1: Lists, Sets, and Tuples in Go

Chapter 3 is where the book starts getting into the real stuff. We’re past the basics of Go syntax and data types, and now it’s time to build actual data structures from scratch. This first half covers three linear data structures: linked lists (both single and double), sets, and tuples.

Apr 13, 2026
software-engineering

Golang DSA Chapter 2 Part 2: Slices, Maps, and Go Patterns

Welcome back. In Part 1 we covered arrays, basic slices, two-dimensional slices, and maps. That was the foundation. Now Kommadi moves into the more interesting Go patterns: variadic functions, defer and panic, and a full CRUD web application that ties it all together. He also shows more advanced slice operations along the way.

Apr 12, 2026
software-engineering

Golang DSA Chapter 2 Part 1: Go Basics for Data Structures

Welcome to Chapter 2 of our walkthrough of “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi. This is Part 1, where we cover the Go-specific building blocks you need before getting into the heavier data structure topics. Think of this as your toolkit chapter.

Apr 11, 2026
software-engineering

Golang DSA Chapter 1 Part 2: Algorithms and How to Think About Performance

In Part 1 we covered what data structures are, how they get classified, and a bunch of structural design patterns. Now we’re getting to the fun stuff: algorithms, how to measure their performance, and a few classic algorithm types.

Apr 10, 2026
software-engineering

Golang DSA Chapter 1 Part 1: What Are Data Structures Anyway?

Chapter 1 of the book jumps right into classifying data structures and showing some structural design patterns in Go. There’s a lot packed in here, so I’m splitting it into two parts. This first part covers what data structures are, how they’re classified, and a bunch of design patterns with Go code.

Apr 09, 2026
software-engineering

I'm Reading 'Learn Data Structures and Algorithms With Golang' So You Don't Have To

So I picked up this book called “Learn Data Structures and Algorithms with Golang” by Bhagvan Kommadi (Packt Publishing, 2019, ISBN: 978-1-78961-850-1). And I’m going to retell the whole thing here on the blog, chapter by chapter.

Apr 07, 2026
software-engineering

Data Engineering With AWS Chapter 11: Ad Hoc Queries With Amazon Athena

This is post 17 in my Data Engineering with AWS retelling series.

You have a data lake. Terabytes of files sitting in S3 across landing zones, clean zones, and transform zones. The data is there. But how do you actually ask it questions? You could spin up a database, load everything into it, and then query. But that defeats the purpose of having a data lake in the first place.

Mar 31, 2026
software-engineering

Data Engineering With AWS Chapter 10: Orchestrating the Data Pipeline

This is post 16 in my Data Engineering with AWS retelling series.

Up to this point in the book, we have been doing everything by hand. Click this button in the console. Run that Glue job manually. Trigger a crawler. Upload a file. It works fine for learning. But imagine doing that in production, every day, at 3 AM, across dozens of data sources. No thanks.

Mar 31, 2026
Review

Final Thoughts: Blockchain and Banking Recap

Previous: What’s Next for Blockchain? The Research Frontier

And that’s a wrap! We’ve spent the last 12 days diving deep into Blockchain and Banking by Pierluigi Martino. This book really changed how I think about the money in my pocket (and on my screen).

Mar 30, 2026
Research & Trends

What’s Next for Blockchain? the Research Frontier

Previous: The Future of Banking: My Final Take

We’ve covered a lot of ground this week. But as Pierluigi Martino points out in the final chapter of Blockchain and Banking, we’re still just at the beginning. Most of what we “know” about blockchain in banking is still just theory.

Mar 30, 2026
Big Data

Wrapping Up: Big Data on Kubernetes

We have reached the end of our deep dive into Big Data on Kubernetes by Neylson Crepalde. It has been a massive journey, moving from basic Docker containers to complex, real-time AI pipelines.

Mar 29, 2026
Kubernetes

Beyond the Basics: The Kubernetes Ecosystem

We have built some incredible pipelines over the last few posts. But if you were to take what we’ve built and put it into production today, you’d quickly realize that there is a lot more to managing a platform than just getting the YAML files right.

Mar 29, 2026
Future of Finance

The Future of Banking: My Final Take

Previous: Smart Contracts in Real Courts: A Legal Nightmare?

We’ve spent the last week looking at the tech, the business side, and the rules of blockchain. Now, we’re at the end of Pierluigi Martino’s book, Blockchain and Banking.

Mar 28, 2026
Generative AI

Action Models With Bedrock Agents

In the last post, we saw how to give an AI model a “memory” using RAG. But the real game-changer in the Generative AI world is when you let the model actually do things.

Mar 28, 2026
Regulation

Smart Contracts in Real Courts: A Legal Nightmare?

Previous: The Rules of the Road: Regulating Blockchain

Yesterday we looked at how governments are trying to regulate the “wild west” of crypto. Today, we’re looking at something even trickier: Smart Contracts.

Mar 27, 2026
Generative AI

GenAI on K8s: Building With Amazon Bedrock

We have spent this whole series talking about “Big Data”—Spark, Kafka, and SQL engines. But the hottest topic in tech right now isn’t just data processing; it’s Generative AI.

Mar 27, 2026
Regulation

The Rules of the Road: Regulating Blockchain

Previous: Rebuilding the Bank: New Business Models

Most people think of blockchain as a “lawless” digital frontier. And for a while, it kind of was. But in Chapter 5, Pierluigi Martino explains how governments around the world are finally catching up and making some rules.

Mar 26, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 3

Batch processing is great for historical reports, but what if you need to know what’s happening right now? In the final part of Chapter 10, Neylson Crepalde shows us how to build a world-class Real-Time Pipeline on Kubernetes.

Mar 26, 2026
Business Strategy

Rebuilding the Bank: New Business Models

Previous: The Threats Banks Are Actually Scared Of

If you think banks are just going to use blockchain to do the “same old thing” a little bit faster, think again. In Chapter 4, Pierluigi Martino uses a popular tool called the Business Model Canvas to show how blockchain is actually rebuilding the entire structure of a bank.

Mar 25, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 2

In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.

Mar 25, 2026
Banking Industry

The Threats Banks Are Actually Scared Of

Previous: Banks vs. Blockchain: Friends or Foes?

Yesterday, we talked about how banks are using blockchain to get faster. But today, we’re looking at why they’re still sweating. Pierluigi Martino points out that while the tech is a great tool, it’s also a massive threat to their bottom line.

Mar 24, 2026
Banking Industry

Banks vs. Blockchain: Friends or Foes?

Previous: Beyond Crypto: The Real Power of Smart Contracts

The relationship between big banks and blockchain is… complicated. It’s like that “frenemy” relationship you had in high school. In Chapter 3, Pierluigi Martino dives into why the banking world is both terrified and obsessed with this tech.

Mar 24, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 1

We have spent the last few weeks looking at individual tools like Spark, Airflow, and Kafka. But in the real world, these tools don’t live in isolation. They need to talk to each other to form a complete data pipeline.

Mar 24, 2026
software-engineering

Data Engineering With AWS Chapter 9 Part 2: Bridging Data Lake and Data Warehouse

This is post 15 in my Data Engineering with AWS retelling series.

In Part 1, we looked at Redshift internals – clusters, slices, distribution styles, sort keys. All the pieces that make a data warehouse fast. But a warehouse sitting in isolation is not very useful. Data needs to flow in from your data lake, and sometimes it needs to flow back out. Part 2 of Chapter 9 covers that bridge between S3 and Redshift, including Redshift Spectrum, the COPY and UNLOAD commands, and a hands-on exercise that ties it all together.

Mar 23, 2026
Blockchain Applications

Beyond Crypto: The Real Power of Smart Contracts

Previous: Blockchain 101: More Than Just Bitcoin

Yesterday we talked about what blockchain is. Today, we’re looking at what it actually does. Most people stop at Bitcoin, but Pierluigi Martino shows us that the real magic happens when we move beyond simple payments.

Mar 23, 2026
Big Data

Real-Time Visualization With Elasticsearch and Kibana

Trino is great for querying your historical data on S3, but for real-time streams and text-heavy search, you need something different. In the second half of Chapter 9, Neylson Crepalde introduces the industry standard for real-time analytics: Elasticsearch and Kibana.

Mar 22, 2026
Blockchain Fundamentals

Blockchain 101: More Than Just Bitcoin

Previous: How Fintech Started the Revolution

If you think blockchain is just another word for Bitcoin, this post is for you. In the second chapter of his book, Pierluigi Martino peels back the curtain on what this tech actually is.

Mar 22, 2026
Big Data

The Data Consumption Layer - Querying With Trino

You’ve built your ingestion, you’ve processed your data with Spark, and it’s all sitting neatly in your S3 “Gold” bucket. Now what? You can’t ask every business analyst to learn PySpark just to see last month’s sales.

Mar 21, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 2

In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.

Mar 21, 2026
Fintech

How Fintech Started the Banking Revolution

Previous: Why This Tech is Actually a Big Deal

Before we talk about blockchain, we need to talk about its big brother: fintech. In the first chapter of Blockchain and Banking, Pierluigi Martino takes us through how we got here.

Mar 20, 2026
Banking Technology

Blockchain and Banking: Why This Tech Is Actually a Big Deal

Ever feel like the banking world is just a bunch of old buildings and slow apps? Well, things are actually moving pretty fast behind the scenes. I just finished reading Blockchain and Banking: How Technological Innovations Are Shaping the Banking Industry by Pierluigi Martino, and it’s a real eye-opener.

Mar 20, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 1

We’ve explored Spark, Airflow, and Kafka as individual tools. But the real goal of Neylson Crepalde’s book is to show you how to run them all as a cohesive “stack” on Kubernetes. In Chapter 8, we finally start the heavy lifting of deployment.

Mar 19, 2026
Big Data

Real-Time Streaming With Apache Kafka - Part 2

Architecture is great, but let’s actually run some code. In the second half of Chapter 7, Neylson Crepalde walks us through setting up a multi-node Kafka cluster right on our local machine using Docker Compose.

Mar 18, 2026
Big Data

Real-Time Streaming With Apache Kafka - Part 1

In the world of big data, “batch” is no longer enough. We need data the second it happens. Whether it’s tracking stock prices, monitoring website traffic, or detecting fraud, you need a system that can handle massive streams of events with zero downtime.

Mar 17, 2026
software-engineering

Data Engineering With AWS Chapter 9 Part 1: Loading Data Into a Data Mart With Redshift

This is post 14 in my Data Engineering with AWS retelling series.

Your data lake is humming along. Data lands in S3, gets cleaned, transformed, cataloged. Athena lets you run SQL queries on it. So why would you need anything else?

Mar 17, 2026
Big Data

Orchestrating Pipelines With Apache Airflow - Part 2

In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.

Mar 16, 2026
Big Data

Orchestrating Pipelines With Apache Airflow - Part 1

If Spark is the engine, then Apache Airflow is the conductor. In a modern data stack, you rarely have just one job running in isolation. You have ingestion, cleaning, processing, and delivery—and they all have to happen in a specific order.

Mar 15, 2026
Big Data

Distributed Processing With Apache Spark - Part 2

In the last post, we looked at Spark’s architecture. Now, let’s talk about how you actually write code for it. Neylson Crepalde highlights two main ways to interact with Spark: the DataFrame API and Spark SQL.

Mar 14, 2026
Big Data

Distributed Processing With Apache Spark - Part 1

If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.

Mar 13, 2026
Big Data

The Tools of the Modern Data Stack

We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.

Mar 12, 2026
Big Data

The Evolution of Data Architecture

We’ve all heard the terms “Data Warehouse” and “Data Lake,” but do you actually know why we keep switching between them? In Chapter 4 of Big Data on Kubernetes, Neylson Crepalde gives a masterclass on how data architecture has evolved to keep up with the modern world.

Mar 11, 2026
Big Data

Scaling to the Cloud With Amazon EKS

Testing things locally with Kind is great, but big data usually needs big iron. In this part of the hands-on journey, Neylson Crepalde shows us how to scale up to a managed cloud environment.

Mar 10, 2026
software-engineering

Data Engineering With AWS Chapter 8: Who Actually Uses All This Data?

This is post 13 in my Data Engineering with AWS retelling series.

We have spent the last several chapters ingesting data, transforming data, optimizing data. Pipelines everywhere. But here is the question nobody asks often enough: who is actually going to use all of this?

Mar 10, 2026
Big Data

Local Kubernetes With Kind

Reading about architecture is one thing, but actually seeing a cluster run is where it sticks. In the third chapter of Big Data on Kubernetes, Neylson Crepalde moves from theory to practice.

Mar 09, 2026
Big Data

Decoding Kubernetes Architecture - Part 2

In the last post, we talked about the “brain and muscles” of a Kubernetes cluster. But how do we actually tell that brain what to do? We use Objects.

Mar 08, 2026
Big Data

Decoding Kubernetes Architecture - Part 1

If you want to run big data workloads on Kubernetes, you have to understand how the system is actually put together. It’s not just “magic magic cloud stuff”—it’s a carefully coordinated cluster of machines.

Mar 07, 2026
Big Data

Building Your Own Data Images

In my last post, we talked about why containers are the bedrock of modern data engineering. But honestly, just running other people’s images only gets you so far. The real magic happens when you start packaging your own custom code.

Mar 06, 2026
Big Data

Why Containers Are a Must for Data Engineers

If you are working with data today, you can’t really ignore containers. They have become the standardized unit for how we develop, ship, and deploy software. But why do we care so much about them in the big data world?

Mar 05, 2026
Big Data

Rethinking Data Infrastructure: Big Data on Kubernetes

We are living in a world where data is basically everywhere. From your phone to social media and every single online purchase, the amount of info we generate is staggering. But here’s the thing: just having data isn’t enough. You have to be able to process it, and that’s where things get complicated.

Mar 03, 2026
software-engineering

Data Engineering With AWS Chapter 7 Part 2: Transforming Data - Optimization and Business Logic

This is post 12 in my Data Engineering with AWS retelling series.

In Part 1, we covered the generic data preparation transforms: converting to Parquet, partitioning, PII protection, and data cleansing. Those transforms work on individual datasets and do not need much business context. Now we get to the transforms that actually create business value. The ones that combine multiple datasets, add context, flatten structures, and produce the tables that analysts and dashboards consume.

Feb 28, 2026
software-engineering

Building NLP and LLM Pipelines - Final Thoughts on the Book

And that’s a wrap. Over the past 24 days, we walked through every chapter of Laura Funderburk’s Building Natural Language and LLM Pipelines. Here are my final thoughts on the book as a whole.

Feb 28, 2026
software-engineering

Data Engineering for Beginners - Closing Thoughts on the Full Series

And that’s it. Eighteen posts. Thirteen chapters. One complete walkthrough of “Data Engineering for Beginners” by Chisom Nwokwu.

When I started this series, I said I wanted to retell the book in my own words. Not a summary, not a copy. My take on what each chapter covers and why it matters. Now that I’m at the end, let me step back and share my overall impressions.

Feb 28, 2026
software-engineering

Data Engineering With GCP: Final Thoughts and Key Takeaways

Twenty-two posts later, we are done. This was my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya (2nd edition, Packt Publishing, 2024, ISBN 978-1-83508-011-5). Time to look back, share what stuck with me, and give an honest assessment.

Feb 28, 2026
software-engineering

Data Engineering With Python: Final Thoughts and Takeaways

That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).

Feb 28, 2026
software-engineering

Final Thoughts on Data Science Foundations by Mariadas and Huke

Nineteen posts. Sixteen chapters. One book. And here we are at the end.

When I started this retelling of Data Science Foundations: Navigating Digital Insight by Stephen Mariadas and Ian Huke (ISBN: 978-1-78017-6994, BCS 2025), I was not sure how it would go. Some books lose steam halfway. Some start strong and fizzle. But this one stayed consistent from first chapter to last.

Feb 28, 2026
software-engineering

Final Thoughts on Python and R for the Modern Data Scientist

So we made it through the whole book. And honestly? It was worth the ride.

What This Book Got Right

The biggest thing Scavetta and Angelov got right is the framing. They didn’t write a “Python is better” or “R is better” book. They wrote a “both are useful, here’s when to use which” book. And that’s the mature take.

Feb 27, 2026
software-engineering

Building a Career in Data Engineering - Roles, Resumes, and Interviews

This is the last technical chapter of the book. Everything before this was about skills, tools, and concepts. Chapter 13 is about what you do with all of that knowledge. How you actually get a job in data engineering.

Feb 27, 2026
software-engineering

Data Engineering With GCP Chapter 13 Part 2: GCP Certifications and Career Next Steps

In Part 1 we went through the quiz questions, extra GCP services, and how the book ties everything together. Now let’s talk about the stuff that matters after you close the book: getting certified, planning your career, and figuring out what comes next.

Feb 27, 2026
software-engineering

Data Science Foundations Chapter 16: Where Data Science Goes From Here

So we made it. Chapter 16 is the conclusion of Data Science Foundations by Stephen Mariadas and Ian Huke. And like most good conclusions, it does not introduce anything new. Instead it steps back and asks: what did we learn, and where is all of this going?

Feb 27, 2026
software-engineering

Python and R Translation Cheat Sheet - Best Equivalents

The appendix of “Python and R for the Modern Data Scientist” is basically a bilingual dictionary. It runs about 40 tables long and covers everything from package management to indexing. You could spend a whole afternoon reading through it.

Feb 27, 2026
software-engineering

Real-Time Edge Data With MiNiFi and Spark - Study Notes From Data Engineering With Python Ch 15

You have NiFi running. Kafka is streaming. Spark is processing. But what about the data source? What happens when your data comes from a tiny sensor or a Raspberry Pi that can barely run a web browser?

Feb 27, 2026
software-engineering

Token Economics, System Integrity Under Failure, and the Sovereign Agent Stack

In the previous post, we looked at how agentic architectures evolved from brittle sequential chains (V1) through router patterns (V2) to resilient supervisors (V3). Now Funderburk puts those architectures under stress. The results are honestly a little scary.

Feb 26, 2026
software-engineering

Agentic AI Architecture: From Monolithic Scripts to Resilient Supervisors

The epilogue of Funderburk’s book is where everything clicks together. All the individual skills from earlier chapters (pipelines, RAG, tool contracts, Haystack components, LangGraph orchestration) get assembled into a single architectural argument. And that argument is surprisingly clear: separate the doing from the thinking.

Feb 26, 2026
software-engineering

Cloud Data Engineering - Storage, Compute, Networking, and Cost on the Cloud

Chapter 12 is the one where everything moves to the cloud. If you’ve been following along, we’ve been talking about databases, pipelines, data quality, security, governance, and big data. All of that can run on your own hardware. But most teams today don’t do that. They use cloud providers. This chapter explains why, and more importantly, how.

Feb 26, 2026
software-engineering

Data Engineering With GCP Chapter 13 Part 1: Growing Your Confidence as a Data Engineer

Chapter 13 is the last chapter in the book, and it’s different from everything that came before. No new GCP services, no hands-on exercises, no Terraform scripts. Instead, Adi Wijaya steps back and talks about the bigger picture: certifications, where data engineering is heading, and how to actually feel confident in this role.

Feb 26, 2026
software-engineering

Data Processing With Apache Spark - Study Notes From Data Engineering With Python Ch 14

You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.

Feb 26, 2026
software-engineering

Data Science Foundations Chapter 15: Real Companies Using Data Science Right Now

Theory is nice. But does any of it work in the real world? Chapter 15 of Data Science Foundations by Stephen Mariadas and Ian Huke answers that with five case studies. Real people, real problems. One of them technically failed. And that is part of the point.

Feb 26, 2026
software-engineering

Real World Bilingual Data Science - A Python and R Case Study

The whole book has been building to this. Six chapters of philosophy, syntax comparisons, and interoperability tricks. Now Chapter 7 drops a real project on the table. Build it with both languages. Together. Start to finish.

Feb 25, 2026
software-engineering

Big Data and Distributed Systems - Chapter 11 Retelling

At some point, your data gets too big for one machine. That’s not a hypothetical. Netflix, Google, Amazon, they all hit that wall years ago. The question is: what do you do when a single server can’t keep up?

Feb 25, 2026
software-engineering

Data Engineering With GCP Chapter 12 Part 2: Building CI/CD Pipelines on Google Cloud

In Part 1 we covered the theory behind CI/CD and ran through a basic Cloud Build exercise with a Python project. Unit tests ran automatically on every push. Broken code got caught before it reached production. Good stuff, but that was a simple calculator script. Now we need to connect this to real data engineering work.

Feb 25, 2026
software-engineering

Data Science Foundations Chapter 14: Machine Learning and AI Explained Simply

Everyone has an opinion about AI. Your coworker worries robots will take his job. Your cousin swears ChatGPT wrote his college essay. Chapter 14 of Data Science Foundations by Stephen Mariadas and Ian Huke explains what machine learning and artificial intelligence actually are. And how data science connects to all of it.

Feb 25, 2026
software-engineering

MCP, A2A Protocol, Agentic Context Engineering, and the Future of AI Interoperability

In the first half of Chapter 9, Funderburk covered hardware limitations and the four big problems with LLMs. Now she gets to the good stuff: the protocols and frameworks that are actually solving those problems.

Feb 25, 2026
software-engineering

Streaming Data With Apache Kafka - Study Notes From Data Engineering With Python Ch 13

Up to this point in the book, data pipelines have been about moving data that already exists. Query a database, read a file, process it, store it. The data sits still and you go get it.

Feb 25, 2026
software-engineering

Using Python and R Together - Tools for Bilingual Data Science

Chapter 6 is where the book finally delivers on its promise. All that talk about using both languages together? This is where it actually happens. Rick Scavetta walks through the nuts and bolts of making Python and R talk to each other in the same project.

Feb 24, 2026
software-engineering

Building a Kafka Cluster - Study Notes From Data Engineering With Python Ch 12

Up to this point in the book, everything has been batch processing. You query a database, get a full dataset, transform it, load it somewhere. The data sits still while you work on it.

Feb 24, 2026
software-engineering

Data Engineering With AWS Chapter 7 Part 1: Transforming Data - The Basics

This is post 11 in my Data Engineering with AWS retelling series.

You have data sitting in your data lake. Raw CSV files, JSON dumps, database extracts. It is all there, technically available, but trying to run analytics on it is painfully slow and expensive. This chapter is about fixing that. Transforming raw data into something optimized, clean, and ready for actual use.

Feb 24, 2026
software-engineering

Data Engineering With GCP Chapter 12 Part 1: CI/CD Basics for Data Engineers

Chapter 12 is a shift from everything we have done so far. Until now, we were learning how to build things: pipelines, data lakes, warehouses, streaming systems. Now the question is: how do you ship all that stuff to production without breaking things? The answer is CI/CD.

Feb 24, 2026
software-engineering

Data Governance Explained - Chapter 10 Retelling

Data governance sounds like something a committee of suits invented to make your life harder. But here’s the thing: without it, everything falls apart quietly.

Feb 24, 2026
software-engineering

Data Science Foundations Chapter 13: Telling the Story Behind Your Data

You did the hard work. You collected data, cleaned it, tested your models. And now you need to tell someone what you found. This is where Chapter 13 of “Data Science Foundations” by Stephen Mariadas and Ian Huke comes in. Communication. The part that separates useful data science from data science that nobody cares about.

Feb 24, 2026
software-engineering

Hardware Limits, NVIDIA NIMs, Edge Deployment, and Why LLMs Still Struggle

Chapter 9 of Laura Funderburk’s book takes a step back from building things and looks forward. What’s coming next for NLP and LLM systems? Where are the bottlenecks? What’s changing?

Feb 24, 2026
software-engineering

Python vs R Workflows - Machine Learning, Visualization, and More

Chapter 5 is where Boyan Angelov gets practical about the question everyone dances around: which language should you actually use for which job?

Feb 23, 2026
software-engineering

Building a Production Data Pipeline - Study Notes From Data Engineering With Python Ch 11

You learned the individual tools. You learned the deployment strategies. Now Chapter 11 of Data Engineering with Python by Paul Crickard puts it all together. This is the chapter where you build a complete, production-grade data pipeline from start to finish.

Feb 23, 2026
software-engineering

Building the Yelp Navigator: Multi-Agent Orchestration With LangGraph, Haystack Microservices, and Supervisor Approval

This is where everything from Chapter 8 comes together. We’ve built NER pipelines, text classification tools, sentiment analyzers. Now Funderburk wires them into a multi-agent graph that can handle complex queries end to end.

Feb 23, 2026
software-engineering

Data Engineering With GCP Chapter 11: Keeping Google Cloud Costs Under Control

Nobody ever got promoted for building the cheapest data pipeline. But plenty of people have gotten uncomfortable phone calls from their CFO after a runaway BigQuery bill. Chapter 11 is about the money side of GCP, and I think this is one of the most practical chapters in the book.

Feb 23, 2026
software-engineering

Data Science Foundations Chapter 12: How to Know if Your Model Actually Works

You built a model. It runs. It gives you numbers. But does it actually work? That is what Chapter 12 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about. Building a model is one thing. Trusting it is something else.

Feb 23, 2026
software-engineering

Data Security for Data Engineers - Chapter 9 Retelling

In 2016, hackers stole personal data of 57 million Uber users and drivers. How? Someone left API credentials in a private GitHub repo. The attackers grabbed those keys, got into AWS, and downloaded everything. Uber didn’t even notice for a year. When they finally found out, they paid the hackers $100,000 to delete the data and kept quiet about it.

Feb 23, 2026
software-engineering

When to Use Python vs R - Data Format Context Explained

Chapter 4 is where the book stops teaching you the languages and starts telling you when to use which one. This is Part III, “The Modern Context,” and Boyan Angelov takes the lead here. The question is simple: given a specific data format, which language gives you a better experience?

Feb 22, 2026
software-engineering

Data Engineering With GCP Chapter 10 Part 2: Data Quality, Security, and Compliance

In Part 1 we covered how data governance breaks into three pillars (usability, security, accountability) and went through metadata, Dataplex search, access control in BigQuery, and the Sensitive Data Protection service for finding PII. Now let’s pick up where we left off: understanding what SDP actually finds, and then moving into the accountability pillar.

Feb 22, 2026
software-engineering

Data Quality: What Bad Data Looks Like and How to Catch It

Chapter 8 of Data Engineering for Beginners opens with a statement that should be obvious but apparently is not: even the best pipelines and storage systems are meaningless if the data they deliver is garbage.

Feb 22, 2026
software-engineering

Data Science Foundations Chapter 11: Making Data Visual and Easy to Understand

You ran the analysis. Got your numbers. Built a model. Now you need to show people what you found. And here is where most data people trip up. They pick the wrong chart, overload it with details, and the audience walks away confused.

Feb 22, 2026
software-engineering

Deploying Data Pipelines - Study Notes From Data Engineering With Python Ch 10

You built your data pipelines. They work on your laptop. Now what? Chapter 10 of Data Engineering with Python by Paul Crickard covers the part everyone eventually has to face: getting your pipelines out of development and into production.

Feb 22, 2026
software-engineering

Python for R Users - Versions, Virtual Environments, and Pandas

Chapter 2 showed Pythonistas how to pick up R. Chapter 3 flips the script. Now it’s the R user’s turn to step into Python territory. Rick Scavetta writes this one, and he does a good job easing R folks into a world that feels messier at first glance.

Feb 22, 2026
software-engineering

Sentiment Analysis Pipelines and Multi-Agent Architecture Design With Haystack and LangGraph

After NER and text classification, Funderburk moves to the third building block: sentiment analysis. Then she starts putting all the pieces together into a multi-agent architecture. This is where the chapter gets really interesting.

Feb 21, 2026
software-engineering

Data Engineering With GCP Chapter 10 Part 1: Data Governance Basics on Google Cloud

Data governance is one of those topics that sounds boring until you realize nobody can find anything in your data platform. Then it becomes very interesting very fast.

Feb 21, 2026
software-engineering

Data Science Foundations Chapter 10 Part 2: Time Series, Classification, and Clustering Models

This is Part 2 of 2 for Chapter 10. In Part 1 we covered how to pick the right model and looked at regression. Now we get into the rest: time series, classification, clustering, and association analysis.

Feb 21, 2026
software-engineering

Hands-on NER Pipelines and Text Classification With Haystack: From Monolithic to Tool-Based Architecture

Chapter 8 is where Funderburk says: enough with single pipelines. Time to build tools. And then make an agent pick which tool to use.

Feb 21, 2026
software-engineering

Monitoring Data Pipelines - Study Notes From Data Engineering With Python Ch 9

You built a data pipeline. It is idempotent, uses atomic transactions, and has version control. It is production ready. But can you tell when it breaks?

Feb 21, 2026
software-engineering

Pipeline Orchestration With Airflow, DAGs, and Data Transformations

This is Part 2 of Chapter 7, continuing from batch and streaming basics.

In Part 1, we covered how batch and streaming pipelines move data around. But here is the thing: having a pipeline is one thing. Making sure all its parts run in the right order, at the right time, without you babysitting it? That is orchestration. And this is where Chapter 7 gets really practical.

Feb 21, 2026
software-engineering

R for Python Developers - Lists, Factors, and Data Wrangling

In Part 1 we covered R basics: setting up your environment, installing packages, working with tibbles, and understanding R’s type system. Now we get to the good stuff. Lists, factors, finding things in your data, and the iteration patterns that make R feel so different from Python.

Feb 20, 2026
software-engineering

CI/CD, Pipeline Serialization, and Hayhooks for Zero-Boilerplate Deployment - Chapter 7 Part 2

In Part 1 we built a FastAPI app, Dockerized it, and locked it down with API keys. That is the “maximum control” path. It works great, but it requires a lot of boilerplate. Part 2 covers two things: automating the whole thing with CI/CD, and a completely different approach that makes most of that boilerplate disappear.

Feb 20, 2026
software-engineering

Data Engineering With GCP Chapter 9: Managing Users and Projects in Google Cloud

Chapter 9 is the one where Adi Wijaya zooms out from data pipelines and asks: okay, but who can access what, and how do we keep this whole thing organized? If the previous chapters taught you how to build things in GCP, this one teaches you how to not let those things turn into a security and management mess.

Feb 20, 2026
software-engineering

Data Pipelines: Batch vs Streaming and When to Use Each

This is Part 1 of Chapter 7. Part 2 covers orchestration and transformations.

Chapter 7 of Data Engineering for Beginners is probably where things start feeling real. You stop talking about storage and tables and start talking about how data actually moves. And the answer is: through pipelines.

Feb 20, 2026
software-engineering

Data Science Foundations Chapter 10 Part 1: Picking the Right Model for Your Data

You have data. You have a question. But which model do you actually use?

Chapter 10 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is the biggest chapter in the book. So big I split this retelling into two parts. This is Part 1. It covers the types of analytics, understanding your data and hypothesis, and how to pick the right model.

Feb 20, 2026
software-engineering

NiFi Registry Version Control - Study Notes From Data Engineering With Python Ch 8

You’ve been building data pipelines for several chapters now. They work. They move data. But here’s the problem: none of them have version control. If you break something, there’s no going back. Chapter 8 of Data Engineering with Python by Paul Crickard fixes that. It introduces the NiFi Registry, a sub-project of Apache NiFi that handles version control for your data pipelines.

Feb 20, 2026
software-engineering

R for Python Developers - Getting Started With RStudio and Tibbles

Chapter 2 is where the book gets hands-on. Rick Scavetta takes the wheel and walks Python developers through R. Not from scratch, but with the assumption you already know how to code. The chapter is big, so I split it into two posts. This is the first half.

Feb 19, 2026
software-engineering

Data Engineering With GCP Chapter 8 Part 2: Vertex AI, AutoML, and ML Pipelines

Part 1 covered the ML basics: what supervised and unsupervised learning are, how a simple model gets trained, and why data engineers should care about ML at all. Now in Part 2, Adi Wijaya moves into the GCP tools that make ML actually work in production. This is where theory meets infrastructure.

Feb 19, 2026
software-engineering

Data Science Foundations Chapter 9: Averages, Probability, and the Math You Actually Need

Math scares people. I get it. You hear “standard deviation” and suddenly you are back in high school staring at the board. But here’s the thing. Chapter 9 of Data Science Foundations by Stephen Mariadas and Ian Huke covers the math basics you actually need for data science. And none of it is that hard.

Feb 19, 2026
software-engineering

Data Warehouses, Data Lakes, and Lakehouses - Data Engineering for Beginners (Ch.6)

Chapter 6 is where the book zooms out from “how to design one database” to “where does all this data actually live in a real company.” The answer: it depends on what you are trying to do with it.

Feb 19, 2026
software-engineering

FastAPI, Docker, and Securing Your NLP Endpoints - Chapter 7 Part 1

Chapter 7 of Laura Funderburk’s book is where the rubber meets the road. You built a RAG pipeline in Chapter 6. Now you need to ship it. Get it out of a notebook and into something that real users can hit with HTTP requests.

Feb 19, 2026
software-engineering

Production Pipeline Features - Study Notes From Data Engineering With Python Ch 7

You built a pipeline. It works on your machine. It runs on a schedule. Data goes in, data comes out. Ship it, right?

Feb 19, 2026
software-engineering

The Origin Stories of Python and R - Chapter 1 Retelling

Chapter 1 is titled “In the Beginning” and it’s written by Rick Scavetta. He opens with a tongue-in-cheek Dickens reference, saying it’s just the best of times for data science. But to understand where we are, we need to look at where Python and R came from. Their origin stories explain why they feel so different today.

Feb 18, 2026
software-engineering

Building a 311 Data Pipeline - Study Notes From Data Engineering With Python Ch 6

The previous chapters taught you the individual tools. Python, NiFi, Airflow, databases, data cleaning. Chapter 6 of Data Engineering with Python by Paul Crickard puts them all together into one real project.

Feb 18, 2026
software-engineering

Data Engineering With GCP Chapter 8 Part 1: Machine Learning Basics for Data Engineers

Chapter 8 is the one where Adi Wijaya finally brings up the topic every data engineer either loves or dreads: machine learning. And honestly, he does a good job of calming down both camps. If you are excited about ML, great. If you think it has nothing to do with your job, think again. This chapter shows why ML and data engineering are way closer than most people realize.

Feb 18, 2026
software-engineering

Data Science Foundations Chapter 8: Cleaning and Preparing Your Data

You know that feeling when you buy fresh ingredients for dinner, and then spend 80% of your time washing, cutting, and peeling? The actual cooking takes 20 minutes. Data science is exactly like that. The cooking is the model. The prep work is this chapter.

Feb 18, 2026
software-engineering

Measuring RAG Quality With RAGAS and Weights & Biases: Evaluation, Observability, and Cost-Performance Tradeoffs

In Part 1, we covered how Funderburk moves from Jupyter notebooks to a production-ready project structure. Docker, uv, SuperComponents, dual Elasticsearch. Now comes the part that actually tells you if your RAG pipeline is any good: systematic evaluation with RAGAS and continuous monitoring with Weights and Biases.

Feb 18, 2026
software-engineering

Normalization and Database Design - Data Engineering for Beginners (Ch.5 Part 2)

This is Part 2 of Chapter 5, continuing from data modeling basics.

If Part 1 was about drawing the blueprint, Part 2 is about keeping the building from falling apart. Normalization is one of those topics that sounds academic until you hit a real bug caused by duplicate data. Then it clicks fast.

Feb 18, 2026
software-engineering

What Modern Data Science Really Means - Python and R Book Preface

The preface of “Python and R for the Modern Data Scientist” sets up the whole book in a few pages. And it does something rare for a tech book. It actually defines what it means by its own title.

Feb 17, 2026
software-engineering

Book Retelling: Python and R for the Modern Data Scientist

I picked up “Python and R for the Modern Data Scientist” by Rick J. Scavetta and Boyan Angelov a while back. It’s an O’Reilly book from 2021, and it caught my eye because it doesn’t pick sides in the Python vs R debate. Instead, it argues you should use both.

Feb 17, 2026
software-engineering

Cleaning and Transforming Data - Study Notes From Data Engineering With Python Ch 5

You can build the best pipeline in the world. You can read files, write to databases, schedule everything with Airflow. But if the data going through that pipeline is messy, none of it matters.

Feb 17, 2026
software-engineering

Data Engineering With AWS Chapter 6 Part 2: Ingesting Streaming Data

This is post 10 in my Data Engineering with AWS retelling series.

Part 1 covered batch ingestion – pulling data from databases into S3 on a schedule. But not all data waits politely for a nightly load. IoT sensors, vehicle telemetry, live gameplay events, social media mentions – this data streams in continuously and often needs to be processed in near-real-time.

Feb 17, 2026
software-engineering

Data Engineering With GCP Chapter 7: Making Data Visual With Looker Studio

You spend weeks building pipelines, modeling data, setting up orchestration. Everything works. Data lands in BigQuery clean and on time. And then someone from the business side asks: “So… where do I see the numbers?” That is exactly where Chapter 7 picks up. All that upstream work has to end somewhere useful, and for most organizations that somewhere is a dashboard.

Feb 17, 2026
software-engineering

Data Modeling and ER Diagrams - Data Engineering for Beginners (Ch.5 Part 1)

This is Part 1 of Chapter 5. Part 2 covers normalization and design best practices.

Chapter 5 of Data Engineering for Beginners by Chisom Nwokwu is about database design. And honestly, this is where things start to feel real. The previous chapters gave us SQL and database basics. Now we are drawing blueprints.

Feb 17, 2026
software-engineering

Data Science Foundations Chapter 7: Where to Find and How to Source Your Data

You have a great hypothesis. Your stakeholders are on board. But none of it matters without the right data.

Chapter 7 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about sourcing. Where do you get data? How do you collect it? How do you know if it is any good?

Feb 17, 2026
software-engineering

From Jupyter Notebooks to Production RAG: Docker, Uv, SuperComponents, and Why Project Structure Matters

Chapter 6 is where the book shifts gears. Hard. Funderburk basically says: “Cool, you built a RAG pipeline. It works on your laptop. Now what?”

Feb 16, 2026
software-engineering

Data Engineering With GCP Chapter 6 Part 2: Stream Processing With Dataflow

In Part 1 we covered Pub/Sub and how messages flow between publishers and subscribers. Now comes the fun part: what do you actually do with those messages once you have them? That’s where Apache Beam and Dataflow come in.

Feb 16, 2026
software-engineering

Data Science Foundations Chapter 6: Understanding Data Properties and Types

Chapter 6 of Data Science Foundations by Stephen Mariadas and Ian Huke is about something that sounds boring but really is not. Properties of data. What kind of data are you working with? And why does it matter so much?

Feb 16, 2026
software-engineering

Knowledge Graphs, Synthetic Test Data, and Multi-Source Pipelines in Haystack

In the last post we learned the rules for building custom Haystack components. Now Funderburk puts those rules to work on a real problem: building a pipeline that creates a knowledge graph from your documents and then generates synthetic test questions from that graph.

Feb 16, 2026
software-engineering

SQL Advanced Queries: JOINs, Subqueries, and Window Functions

This is Part 2 of Chapter 4, continuing from the SQL basics.

In Part 1 we covered how to pull data from one table. Filter it, sort it, count it. But real databases have many tables. Customers in one, orders in another, products in a third. The interesting stuff happens when you combine them.

Feb 16, 2026
software-engineering

Working With Databases - Study Notes From Data Engineering With Python Ch 4

Most data pipelines start with a database. Most of them end with one too. Chapter 4 of Paul Crickard’s book is about connecting Python to databases and moving data between them. If the previous chapter was about flat files, this one is where things get real.

Feb 15, 2026
software-engineering

Custom Haystack Components: The @Component Decorator, Input/Output Contracts, and Warm_up

Chapter 5 is where Funderburk says: stop being a user of Haystack. Start being an architect. Up until now, the book has been about plugging together existing components. Now you learn to build your own.

Feb 15, 2026
software-engineering

Data Engineering With GCP Chapter 6 Part 1: Real-Time Data With Pub/Sub

Chapter 6 is where Adi Wijaya switches gears from batch to real-time. After spending Chapters 3 through 5 on batch pipelines with BigQuery, Cloud Composer, and Dataproc, now it is time to talk about streaming data. Two GCP services carry this chapter: Pub/Sub and Dataflow. This post covers the streaming concepts and Pub/Sub. Dataflow gets its own post in Part 2.

Feb 15, 2026
software-engineering

Data Science Foundations Chapter 5: The Discovery Phase and Asking the Right Questions

You got a data science project. Great. But before you touch any data, before you write a single line of code, you need to stop and think. That is what Chapter 5 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about. The discovery phase. The part most people want to skip. And it is the part that saves you from wasting months on something that never had a chance.

Feb 15, 2026
software-engineering

Reading and Writing Files in Python - Study Notes From Data Engineering With Python Ch 3

Chapter 3 is where Crickard moves from setup to actual work. You installed all those tools in Chapter 2. Now you use them. The chapter covers one of the most fundamental tasks in data engineering: getting data out of text files and into something useful.

Feb 15, 2026
software-engineering

SQL Basics: SELECT, WHERE, and Aggregate Functions

This is Part 1 of Chapter 4. Part 2 covers joins and advanced queries.

Chapter 4 is where Nwokwu puts SQL in your hands. No more theory. You write queries, you get results, you learn by doing. If Chapter 3 was about understanding what databases are, this chapter is about talking to them.

Feb 14, 2026
software-engineering

Building Your Data Engineering Setup - Study Notes From Data Engineering With Python Ch 2

Chapter 1 was all theory. Now it’s time to actually install stuff. Chapter 2 of Data Engineering with Python by Paul Crickard is a setup chapter. You install the tools, configure them, and make sure everything talks to each other.

Feb 14, 2026
software-engineering

Data Engineering With GCP Chapter 5 Part 2: Working With Spark on Dataproc

In Part 1 we set up a Dataproc cluster, got familiar with HDFS, and touched on what a data lake actually is. Now it is time to get into the real work: writing PySpark code, understanding RDDs, moving data between HDFS, GCS, and BigQuery, and learning how to actually submit Spark jobs to Dataproc.

Feb 14, 2026
software-engineering

Data Science Foundations Chapter 4: Ethics, Laws, and Doing the Right Thing With Data

Imagine your company asks you to build a model that predicts health outcomes for people. Sounds great, right? Better treatments, healthier population, maybe even lower costs. But what if your health data gets shared? What if your insurance premiums go up because of something the model found? What if you get denied a service?

Feb 14, 2026
software-engineering

Database Fundamentals: SQL, NoSQL, and ACID

Chapter 3 is where things get real. You stop talking about data in the abstract and start working with the thing that actually holds it: databases. If you plan to do any data engineering at all, this is where your daily life begins.

Feb 14, 2026
software-engineering

Hybrid RAG: Parallel Retrieval, Fusion, Reranking, and Multimodal Pipelines

In the last post we built a naive RAG pipeline. It works, but it has a blind spot: it only understands meaning, not exact words. Search for error code “ERR-4052” and the semantic retriever might miss the one document that contains that exact string. This is the vocabulary mismatch problem, and hybrid RAG is how you fix it.

Feb 13, 2026
software-engineering

Data Engineering With GCP Chapter 5 Part 1: Building a Data Lake on Google Cloud

Chapter 5 is where things get interesting if you come from a traditional database background. We are leaving the nice structured world of BigQuery and entering the territory of raw files, distributed storage, and the Hadoop ecosystem. Welcome to the data lake.

Feb 13, 2026
software-engineering

Data Science Foundations Chapter 3: How to Actually Deliver a Data Science Project

Chapter 2 was about stakeholders. Now Chapter 3 asks a very practical question: how do you actually get a data science project done?

Feb 13, 2026
software-engineering

Haystack Pipelines: Indexing, Multimodal Processing, and Your First RAG System

Chapter 4 is where you stop reading about components and actually start wiring them together. Laura Funderburk calls it “Bringing Components Together,” and that’s exactly what it is. You take all those building blocks from Chapter 3 and connect them into working pipelines.

Feb 13, 2026
software-engineering

Introduction to Data Engineering - The Oil Refinery, the Lifecycle, and the People

Chapter 1 was about understanding data itself. Chapter 2 answers the bigger question: what do data engineers actually do with it?

Feb 13, 2026
software-engineering

What Is Data Engineering? Study Notes From Data Engineering With Python Ch 1

Chapter 1 of Data Engineering with Python by Paul Crickard starts with the basics. What is data engineering? What do data engineers actually do? And how is it different from data science?

Feb 12, 2026
software-engineering

Data Engineering With GCP Chapter 4 Part 2: Airflow Scheduling, Idempotency, and Sensors

In the first part we got Cloud Composer running, wrote our first DAGs, and learned operators. This second part covers the stuff that separates beginner Airflow code from production-ready pipelines: variables, idempotent tasks, backfilling, sensors, and dataset-driven scheduling.

Feb 12, 2026
software-engineering

Data Engineering With Python: My Study Notes From Paul Crickard's Book

So I picked up Data Engineering with Python by Paul Crickard (Packt, 2020, ISBN: 978-1-83921-418-9) and decided to write up my study notes as I go through it. I’ve been working in IT for over 20 years, and data engineering keeps coming up everywhere. This book seemed like a good one to work through and share what I learn.

Feb 12, 2026
software-engineering

Data Science Foundations Chapter 2: Who Are Your Stakeholders and Why They Matter

You built a model. It works. The numbers look great. But nobody uses it.

Because you forgot the people around it. Chapter 2 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about exactly that. Stakeholders. The humans who care about your project, fund it, get affected by it, or can shut it down.

Feb 12, 2026
software-engineering

Haystack 2.0: RAG as a Tool, Multi-Tool Agents, and the Full Component Catalog

In the first part we covered Haystack 2.0’s core ideas: components, pipelines, SuperComponents, and how hybrid retrieval works. Now let’s look at what happens when you hand these pipelines to an AI agent, plus the full catalog of component types Haystack gives you out of the box.

Feb 12, 2026
software-engineering

Understanding Data - Types, History, and Why It Matters

The book opens with a simple claim: data is the new oil. You’ve probably heard that phrase a hundred times. But Nwokwu doesn’t just drop the cliche and move on. She actually walks you through why that comparison holds up, starting from thousands of years ago.

Feb 11, 2026
software-engineering

Data Engineering for Beginners by Chisom Nwokwu - Book Retelling Series

I picked up “Data Engineering for Beginners” by Chisom Nwokwu (Wiley, 2026, ISBN: 9781394325412) a few weeks ago. I was looking for something that explains data engineering from scratch, without assuming you already know half the field. This book does exactly that.

Feb 11, 2026
software-engineering

Data Engineering With GCP Chapter 4 Part 1: Automating Data Workflows With Cloud Composer

Up until now in the book, we built BigQuery tables by hand, wrote queries in the console, and loaded data manually. That works for learning, but nobody does that in production. In production, you need things to run on their own, on schedule, without you babysitting them at 5 AM.

Feb 11, 2026
software-engineering

Data Science Foundations Chapter 1: What Is Data Science Really About?

You probably watched Moneyball. Brad Pitt, baseball, small team beats the rich guys using numbers. Good movie. But what the movie really shows is something bigger. It shows what happens when you take data seriously. And that is basically what data science is about.

Feb 11, 2026
software-engineering

Haystack 2.0 by Deepset: Components, Pipelines, Document Stores, and Retrievers

Chapter 3 of Laura Funderburk’s book is where the rubber meets the road. We stop talking theory and start looking at an actual framework you can use to build real NLP pipelines. That framework is Haystack 2.0 by a company called deepset.

Feb 10, 2026
software-engineering

Data Engineering With AWS Chapter 6 Part 1: Ingesting Batch Data

This is post 9 in my Data Engineering with AWS retelling series.

You have your whiteboard architecture from Chapter 5. You know who your data consumers are and what they need. Now it is time to actually move data. Chapter 6 covers data ingestion – getting data from wherever it lives into your AWS data lake. This first part focuses on batch ingestion from databases and files. Part 2 covers streaming.

Feb 10, 2026
software-engineering

Data Engineering With GCP Chapter 3 Part 2: Data Modeling and BigQuery Features

In Part 1 we loaded CSV files into BigQuery and built a simple warehouse from a MySQL export. Now the book throws a second scenario at us: bike-sharing data. More tables, daily batch loading, and a real question that every data engineer has to face sooner or later. How do you actually model your data so that business people can use it without calling you every five minutes?

Feb 10, 2026
software-engineering

Starting a Book Retelling: Data Science Foundations by Mariadas and Huke

I have been working in tech for over 20 years. Seen a lot of trends come and go. But data science is not a trend. It is here to stay. And honestly, I wanted a book that explains the whole thing from the ground up without assuming I already know everything.

Feb 10, 2026
software-engineering

Vector Stores, Agentic Memory, and the Economics of LLMs - Chapter 2 Part 3

Parts 1 and 2 of this chapter covered transformer architecture, the SLM/RLM split, context engineering strategies, and the Haystack + LangGraph hybrid architecture. Now Funderburk closes the chapter with two topics that every developer building LLM applications needs to understand: vector stores and the economics of inference.

Feb 09, 2026
software-engineering

Context Engineering, Prompt Strategies, and Framework Wars - Chapter 2 Part 2

In Part 1, we covered how transformers work and how models split into small language models (SLMs) and reasoning language models (RLMs). Now Funderburk shifts to a big question: how do you actually interact with these models in a reliable way?

Feb 09, 2026
software-engineering

Data Engineering With GCP Chapter 3 Part 1: Your First BigQuery Data Warehouse

Chapter 3 is where things get real. Up to now the book was setting the stage, explaining what data engineering is, showing you around GCP. Now Adi Wijaya says: okay, let’s actually build something. And the something is a data warehouse in BigQuery.

Feb 08, 2026
software-engineering

Data Engineering With GCP Chapter 2: Getting Started With Google Cloud for Big Data

Chapter 2 is where Adi Wijaya starts showing what Google Cloud Platform actually has for data engineers. After the theory in Chapter 1, this one is about opening GCP for the first time and figuring out which services matter and which ones you can safely ignore for now.

Feb 08, 2026
software-engineering

Transformers, Attention, and the Evolution of LLMs - Chapter 2 Part 1

Chapter 2 of Laura Funderburk’s book opens with the big picture of large language models. Where they came from, how they work inside, and where they are heading. If Chapter 1 was about pipelines, this chapter is about the models that sit at the center of those pipelines.

Feb 07, 2026
software-engineering

Data Engineering With GCP Chapter 1: What Is Data Engineering Anyway?

Chapter 1 starts with a confession most of us in the data world can relate to. Adi Wijaya says he used to think data was clean. Neatly organized, ready to go. Then he actually worked with data in real organizations and realized most of the effort goes into collecting, cleaning, and transforming it. Not the fun machine learning part. The plumbing part.

Feb 07, 2026
software-engineering

NLP Pipeline Fundamentals Part 2: Tokenization, Embeddings, LLM Roles, and the Road to Agentic Pipelines

In Part 1 we covered the agentic reliability crisis, what data pipelines are, and why classic NLP techniques are being reborn as tools for AI agents. Now let’s get into the specifics: how tokenization and embeddings actually work, what LLMs are, and the two very different roles they play in modern agentic systems.

Feb 06, 2026
software-engineering

Data Engineering With Google Cloud Platform: A Book Retelling Series

I just finished reading “Data Engineering with Google Cloud Platform” by Adi Wijaya (2nd edition, Packt Publishing, 2024) and I want to share what I learned. Not as a dry summary, but more like telling a friend what the book is about over coffee.

Feb 06, 2026
software-engineering

NLP Pipeline Fundamentals: Data Pipelines, the Agentic Reliability Crisis, and Why Classic NLP Still Matters

Chapter 1 of Laura Funderburk’s book opens with something I wish more people in the AI space would say out loud: the era of pure experimentation with LLMs is over. We’re past the “look what ChatGPT can do” stage. The real question now is: can you trust this thing in production?

Feb 05, 2026
software-engineering

Building NLP and LLM Pipelines With Haystack - Book Retelling Series

So I just finished reading Building Natural Language and LLM Pipelines by Laura Funderburk, and I wanted to share what I learned. This is one of those books that bridges the gap between “I can make a ChatGPT wrapper” and “I can build production AI systems that actually work.”

Feb 03, 2026
software-engineering

Data Engineering With AWS Chapter 5: Architecting Data Engineering Pipelines

This is post 8 in my Data Engineering with AWS retelling series.

You have learned about data engineering principles, data architectures, the AWS toolkit, and data governance. Now comes the part where it all comes together. Chapter 5 is about designing an actual data pipeline. Not writing code yet. Just thinking. Planning. Drawing on a whiteboard.

Jan 27, 2026
software-engineering

Data Engineering With AWS Chapter 4 Part 2: Data Governance in Practice

In Part 1, we covered the theory: what data security and governance mean, how catalogs prevent your lake from becoming a swamp, and the core AWS services for encryption and identity. Now it is time to put it into practice.

Jan 20, 2026
software-engineering

Data Engineering With AWS Chapter 4 Part 1: Data Cataloging and Security

You can have the fastest data pipeline on the planet. You can have the slickest dashboards, the fanciest machine learning models, the most optimized Parquet files. None of it matters if your data gets stolen, mishandled, or dumped into a lake that nobody can navigate.

Jan 13, 2026
software-engineering

Data Engineering With AWS Chapter 3 Part 2: The AWS Toolkit - Analytics and Processing

In Part 1 we covered how data gets into AWS. Now comes the good part: what do you actually do with it once it is there? This post covers the services for transforming raw data, orchestrating multi-step pipelines, and letting people query and visualize the results.

Jan 06, 2026
software-engineering

Data Engineering With AWS Chapter 3 Part 1: The AWS Toolkit - Storage and Databases

Chapter 3 is massive. It is basically a catalog of every AWS service a data engineer will touch, from getting data in to getting answers out. So I am splitting it into two posts. This first part covers how data gets into AWS – all the ingestion services, the streaming tools, and the physical devices AWS will literally ship to your door.

Dec 30, 2025
software-engineering

Data Engineering With AWS Chapter 2: Data Management Architectures for Analytics

Chapter 1 gave us the “who” and “why” of data engineering. Now it is time for the “where.” Where does all that data actually live? How do organizations store, organize, and serve billions of rows of information so that someone on the business side can pull up a dashboard and make a decision before lunch?

Dec 23, 2025
software-engineering

Data Engineering With AWS Chapter 1: What Even Is Data Engineering?

If someone told you twenty years ago that data would become more valuable than oil, you would have laughed. But here we are. The most valuable companies on the planet are not drilling for crude. They are collecting, processing, and squeezing insights out of massive piles of data. And behind every one of those companies, there is a team of data engineers making it all work.

Dec 16, 2025
software-engineering

Data Engineering With AWS: A Book Retelling Series for the Cloud-Curious

Every company today is drowning in data. Clicks, transactions, sensor readings, log files, social media posts. It just keeps coming. But raw data sitting in a pile is useless. The real magic happens when someone builds the pipes that move it, clean it, reshape it, and deliver it to the people who need it.

Jan 23, 2019
Big Data

Wrapping Up: The Future of Big Data Analytics

Previous: Elastic MapReduce: Running Hadoop in the AWS Cloud

We’ve covered a lot of ground in this series. From the basic blocks of HDFS to the real-time speeds of Flink and the limitless scale of the AWS cloud. After spending a lot of time with Sridhar Alla’s Big Data Analytics with Hadoop 3, I have a few final thoughts to share.

Jan 22, 2019
Big Data

Elastic MapReduce: Running Hadoop in the AWS Cloud

Previous: Mastering AWS for Big Data: EC2, S3, and EMR

In the last post, we looked at the basic building blocks of AWS: EC2 and S3. But if you’re trying to run a massive Hadoop or Spark cluster, you don’t really want to be manually installing software on hundreds of individual EC2 instances. That’s where Amazon EMR (Elastic MapReduce) comes in.

Jan 21, 2019
Big Data

Mastering AWS for Big Data: EC2, S3, and EMR

Previous: Comparing the Giants: AWS, Azure, and Google Cloud

We’ve talked about the “what” and the “why” of the cloud. Now it’s time for the “how.” Chapter 12 of Sridhar Alla’s book is a deep look at Amazon Web Services (AWS), which is essentially the playground where most big data pros spend their time.

Jan 20, 2019
Big Data

Comparing the Giants: AWS, Azure, and Google Cloud

Previous: Cloud Computing for Big Data: An Introduction

In the last post, we looked at the basic models of the cloud (IaaS, PaaS, and SaaS). Today, we’re talking about the “where” and the “who.” When you decide to move your big data to the cloud, you have to choose a deployment model and a provider.

Jan 19, 2019
Big Data

Cloud Computing for Big Data: An Introduction

Previous: Visualizing Big Data: Turning Numbers into Insight

We’ve spent this entire series talking about how to set up and run your own Hadoop cluster. But let’s be real: managing hardware is a pain. You have to buy servers, set up networking, worry about power outages, and pray that your hard drives don’t fail.

Jan 18, 2019
Big Data

Visualizing Big Data: Turning Numbers Into Insight

Previous: Flink Connectors and Event Time: Mastering the Stream

You’ve done the hard work. You’ve set up a Hadoop cluster, written MapReduce jobs, and built real-time pipelines in Spark and Flink. You have “insights.” But here’s the problem: nobody wants to look at a raw HDFS file or a console log.

Jan 17, 2019
finance

Tools, Not Magic: Final Thoughts on Finance Analytics With Python

We’ve reached the end of our walkthrough of Data Analytics for Finance Using Python. We’ve covered everything from basic stats to deep learning, and if there’s one thing I hope you’ve learned, it’s this: these are tools, not magic.

Jan 17, 2019
Big Data

Flink Connectors and Event Time: Mastering the Stream

Previous: Stream Processing with Apache Flink: True Real-Time Analytics

In the last post, we looked at Flink’s DataStream API. Today, we’re tackling the big questions: How does Flink handle the messy reality of the real world? How does it talk to other systems? And how does it deal with data that shows up late?

Jan 16, 2019
software-engineering

Learning Systems Thinking: Final Thoughts on the Book

So here we are. Sixteen posts later and I’m done retelling “Learning Systems Thinking” by Diana Montalion. Time to wrap it up with what actually stuck with me.

Jan 16, 2019
deep-learning

Memory in the Machine: Predicting Stock Prices With LSTM

We’ve saved the big gun for last. In Chapter 15 of Data Analytics for Finance Using Python, we dive into Deep Learning with Long Short-Term Memory (LSTM) models.

Jan 16, 2019
Big Data

Stream Processing With Apache Flink: True Real-Time Analytics

Previous: Flink DataSet API: Transformations, Joins, and Aggregations

We’ve talked about how Spark handles streaming using micro-batches. It’s a great approach, but some people argue it’s not “true” streaming. If you need the absolute lowest latency possible, you want Apache Flink.

Jan 15, 2019
nlp

Reading the Room: Stock Sentiment Analysis With NLP

Stocks aren’t just driven by math; they’re driven by people. And people are emotional. In Chapter 14 of Data Analytics for Finance Using Python, we look at Natural Language Processing (NLP)—a way to turn human chatter into useful data.

Jan 15, 2019
software-engineering

Systems Thinking Chapter 12: Redefining Success - Part 2

This is Part 2 of 2 for Chapter 12. If you missed Part 1, go read it first. We covered how success is a system, enabling constraints, root causes, equalizing impact, knowledge flow, and the paradigm shift. Now we finish the chapter and the book.

Jan 15, 2019
Big Data

Flink DataSet API: Transformations, Joins, and Aggregations

Previous: Batch Analytics with Apache Flink: The New Challenger

In the last post, we got Flink up and running. Now, let’s actually do something useful with it. Chapter 8 of Sridhar Alla’s book focuses on the DataSet API, which is what you’ll use for all your batch processing needs.

Jan 14, 2019
data-visualization

Pictures of Profit: Visualizing Stock Risk Analysis

They say a picture is worth a thousand words, and in the stock market, it’s worth even more. In Chapter 13 of Data Analytics for Finance Using Python, we step away from the raw numbers and look at how to actually see the data.

Jan 14, 2019
software-engineering

Systems Thinking Chapter 12: Redefining Success - Part 1

We are near the end of the book. Chapter 12 is called “Redefining Success” and it asks two questions that sound simple but are not. How do you know you are learning systems thinking? And what does success even mean when you look at the whole system?

Jan 14, 2019
Big Data

Batch Analytics With Apache Flink: The New Challenger

Previous: Structured Streaming: The Modern Way to Handle Data Streams

We’ve spent a lot of time on Spark, and for good reason - it’s amazing. But if you’re serious about big data, you need to know about Apache Flink. In Chapter 8, Sridhar Alla introduces us to the technology that many experts consider the “true” successor to MapReduce for real-time processing.

Jan 13, 2019
machine-learning

Finding the Line: Predicting Stocks With Support Vector Machines

If you’re looking for the heavyweight champion of classification models, you’ve probably heard of Support Vector Machines (SVM). In Chapter 12 of Data Analytics for Finance Using Python, we see why this model is a favorite for high-precision tasks.

Jan 13, 2019
software-engineering

Systems Thinking Chapter 11: Systems Leadership

Chapter 11 is about leadership. But not the kind you see on LinkedIn where someone posts a sunset photo and writes “leaders eat last.” Diana is talking about something very different. Systems leadership is about improving how knowledge flows through your organization. Not about your title, not about your authority, not about how many people report to you.

Jan 13, 2019
Big Data

Structured Streaming: The Modern Way to Handle Data Streams

Previous: Real-Time Analytics with Spark Streaming

In the last post, we looked at DStreams, the original way to do streaming in Spark. But things move fast in the tech world. Spark 2.0 introduced Structured Streaming, a new way to handle real-time data that makes things even simpler and more reliable.

Jan 12, 2019
statistics

Standing Out From the Mean: Assessing Stock Risk With the Z-Score

If you’ve ever heard someone say a stock’s price is “three standard deviations away from the mean,” they’re talking about Z-Scores. In Chapter 11 of Data Analytics for Finance Using Python, we explore how to use this tool to find the “weird” data points that might actually be opportunities.

Jan 12, 2019
software-engineering

Systems Thinking Chapter 10: Modeling Together - Part 2

This is Part 2 of Chapter 10 from “Learning Systems Thinking” by Diana Montalion. Part 1 covered what modeling is and different modeling approaches. Now we get into the practical stuff. How do you actually use modeling to solve real problems? Diana brings back the MAGO case study to show us.

Jan 12, 2019
Big Data

Real-Time Analytics With Spark Streaming

Previous: Spark SQL and Aggregations: Joining Your Data at Scale

Up until now, we’ve mostly talked about batch processing - looking at data that’s already sitting in HDFS. But what if you need to know what’s happening right now? What if you’re tracking a stock price, monitoring a server for hacks, or following a trending hashtag on Twitter? That’s where Spark Streaming comes in.

Jan 11, 2019
statistics

Different Means, Different Risks: Assessing Stock Risk With the T-Test

We just talked about the F-Test for comparing risk (variance), but what if you want to know if two stocks actually earn different amounts on average? In Chapter 10 of Data Analytics for Finance Using Python, we look at the T-Test.

Jan 11, 2019
software-engineering

Systems Thinking Chapter 10: Modeling Together - Part 1

Chapter 10 is a big one, so I’m splitting it into two parts. This is Part 1 of 2.

Diana opens with a Donella Meadows quote that sets the tone for everything that follows: get your model out where people can see it, invite others to challenge it. That’s the whole chapter in one sentence, really. But of course there’s much more to unpack.

Jan 11, 2019
Big Data

Spark SQL and Aggregations: Joining Your Data at Scale

Previous: Batch Analytics with Apache Spark: Faster Than MapReduce

In the last post, we looked at why Spark is so fast. Today, we’re getting into the nitty-gritty of how to actually use it. If you’re a SQL fan, you’re going to love this. Chapter 6 of Sridhar Alla’s book spends a lot of time on Spark SQL, and for good reason - it’s where most of the work happens.

Jan 10, 2019
software-engineering

Systems Thinking Chapter 9: Pattern Thinking

We all see patterns. You get sick and you think back. Was I stressed? Did I sleep enough? Was someone around me sick? You compare symptoms to last time. You adjust your schedule.

Jan 10, 2019
statistics

Which One Is Riskier? Assessing Stock Risk With the F-Test

If you’re choosing between two stocks, you don’t just want to know which one has a higher return. you want to know which one is more likely to give you a heart attack. In Chapter 9 of Data Analytics for Finance Using Python, we look at the F-Test as a way to compare risk.

Jan 10, 2019
Big Data

Batch Analytics With Apache Spark: Faster Than MapReduce

Previous: Statistical Computing with R and Hadoop

If you’ve been following this series, you know we’ve spent a lot of time on MapReduce. It’s the foundation of Hadoop, but let’s be honest: it can be slow and painful to write. That’s why Chapter 6 of Sridhar Alla’s book is such a breath of fresh air. It introduces Apache Spark, the technology that has effectively dethroned MapReduce for most big data tasks.

Jan 09, 2019
statistics

Multiple Perspectives: Stock Prediction With Multiple Regression

In the world of finance, nothing happens in a vacuum. A stock’s closing price isn’t just a random number; it’s influenced by the opening price, the daily high, the daily low, and a dozen other factors.

Jan 09, 2019
software-engineering

Systems Thinking Chapter 8: Designing Feedback Loops

When you hear “feedback loop” you probably think about monitoring dashboards. Or autoscaling. Or maybe that annoying annual performance review your manager gives you. Diana Montalion says all of that is too narrow. Chapter 8 is about feedback loops for thinking. Not for servers.

Jan 09, 2019
Big Data

Statistical Computing With R and Hadoop

Previous: Scientific Computing with Python and Hadoop

If Python is the general-purpose king of data science, R is the specialized wizard of statistics. While Python is great for building pipelines and apps, R was built by statisticians, for statisticians. In Chapter 5, Sridhar Alla shows us how to bring that statistical power to the massive datasets sitting in Hadoop.

Jan 08, 2019
software-engineering

Systems Thinking Chapter 7: Collective Systemic Reasoning

Chapter 7 is where Diana shifts from “you” to “we.” Previous chapters were about your own thinking, your own reactions, your own learning. Now it’s time to think together. Because in systems, your brain alone is not enough.

Jan 08, 2019
statistics

The Foundations of Risk: Stock Risk Analysis With Descriptive Statistics

Before you start building complex machine learning models, you need to understand the data you’re actually working with. In Chapter 7 of Data Analytics for Finance Using Python, we go back to the basics: Descriptive Statistics.

Jan 08, 2019
Big Data

Scientific Computing With Python and Hadoop

Previous: Advanced MapReduce: Joins and Filtering Patterns

Java and MapReduce are great for the heavy lifting, but when it comes to actually exploring data and building models, Python is where it’s at. Chapter 4 of Sridhar Alla’s book shifts the focus to how we can use Python’s massive ecosystem to analyze big data.

Jan 07, 2019
software-engineering

Systems Thinking Chapter 6: A System of Learning

The book is called “Learning Systems Thinking.” We talked about systems. We talked about thinking. Chapter 6 is about the third word. Learning.

Jan 07, 2019
machine-learning

The Branching Path: Investment Management With Decision Trees

We’ve talked about random forests, but sometimes it’s better to look at the individual trees. In Chapter 6 of Data Analytics for Finance Using Python, we dive into the Decision Tree Classifier.

Jan 07, 2019
Big Data

Advanced MapReduce: Joins and Filtering Patterns

Previous: Deep Look at MapReduce: How Hadoop Processes Data

In the last post, we looked at the basics of MapReduce. But in the real world, your data is rarely in one single file. You usually have a few different datasets that you need to combine. This is where things get a little more complex - and a lot more interesting.

Jan 06, 2019
machine-learning

Randomness as a Tool: Stock Trading With Random Forest

The stock market is a rollercoaster. One minute you’re up, the next you’re down, and trying to predict that movement is like trying to catch lightning in a bottle. In Chapter 5, we look at a tool designed to handle that chaos: the Random Forest.

Jan 06, 2019
software-engineering

Systems Thinking Chapter 5: Replace Reacting With Responding

This chapter hit close to home. If you ever rage-typed a Slack message, hit send, and then immediately wished you could unsend it, this one is for you.

Jan 06, 2019
Big Data

Deep Look at MapReduce: How Hadoop Processes Data

Previous: SQL on Hadoop: Getting Started with Apache Hive

We’ve talked about Hive, but today we’re going under the hood. MapReduce is the engine that actually does the heavy lifting in Hadoop. Sridhar Alla’s third chapter is a deep look at how this framework takes a massive pile of data and turns it into something useful.

Jan 05, 2019
machine-learning

Probabilistic Profits: Stock Decisions With Gaussian Naive Bayes

If you’re looking for a machine learning model that’s fast, efficient, and actually outshines more complex models in some cases, you need to look at Gaussian Naive Bayes (GNB).

Jan 05, 2019
software-engineering

Systems Thinking Chapter 4: Self-Awareness as a Foundational Skill

Chapter 4 starts Part II of the book. And Part II hits different. Part I was about systems out there, in the world, in software. Part II turns the mirror around. Now we’re looking at ourselves.

Jan 05, 2019
Big Data

SQL on Hadoop: Getting Started With Apache Hive

Previous: The World of Big Data Analytics: Processes and Tools

If you’ve ever tried to write a MapReduce job just to count the number of lines in a file, you know it’s a lot of work. You have to write a Mapper, a Reducer, a Driver… it’s a whole thing.

Jan 04, 2019
machine-learning

Buy or Sell? Stock Investment Strategy With Logistic Regression

In Chapter 3 of Data Analytics for Finance Using Python, we move from predicting exact prices to making a much simpler, but arguably more important, decision: Should I buy or sell?

Jan 04, 2019
software-engineering

Systems Thinking Chapter 3: Shifting Your Perspective

Chapter 3 opens with a quote from Donald Berwick: “Every system is perfectly designed to get the results it gets.” Read that again. If your system produces bad results, it was designed to produce bad results. Maybe not on purpose. But the design got you there.

Jan 04, 2019
Big Data

The World of Big Data Analytics: Processes and Tools

Previous: Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Now that we’ve got a cluster running, let’s talk about why we bother with all this complexity in the first place. Chapter 2 of Sridhar Alla’s book takes a step back to look at the big picture of data analytics.

Jan 03, 2019
time-series

Forecasting the Future: Predicting Stock Prices With ARIMA

Predicting the stock market is basically the Holy Grail of finance. Everyone wants to know what’s going to happen tomorrow. In Chapter 2, we look at a classic tool for this: the ARIMA model.

Jan 03, 2019
software-engineering

Systems Thinking Chapter 2: Crafting Conceptual Integrity

Chapter 2 opens with a quote from Fred Brooks: “Conceptual integrity is the most important consideration in system design.” Written decades ago. Still true. Maybe more true now than ever.

Jan 03, 2019
Big Data

Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Previous: Getting Started with Hadoop 3: What’s New and Why It Matters

In the last post, we talked about all the cool new features in Hadoop 3. Now, let’s actually build something. Sridhar Alla’s book gives a solid walkthrough on setting up a single-node cluster. If you’re on Linux, this is pretty straightforward.

Jan 02, 2019
machine-learning

Sorting the Stock Market: Portfolio Management With K-Means

In the first chapter of Data Analytics for Finance Using Python, we get into the nitty-gritty of portfolio management using something called K-Means clustering.

Jan 02, 2019
software-engineering

Systems Thinking Chapter 1: What Even Is Systems Thinking?

Diana starts Chapter 1 with a warning. Reading a book about systems thinking will not teach you systems thinking. Just like reading a book about tennis will not teach you tennis. You have to go outside and play. Fair enough. But you still need to know the rules before you step on the court.

Jan 02, 2019
Big Data

Getting Started With Hadoop 3: What's New and Why It Matters

Previous: Big Data for the Rest of Us

Hadoop has been around for a while, but version 3 is where things get really interesting. If you’ve worked with Hadoop 1 or 2, you know it was solid but had some pain points. Sridhar Alla’s book kicks off by looking straight at what’s changed.

Jan 01, 2019
finance

Finance Meets Python: Making Sense of the Stock Market

So, I’ve been thinking about how the stock market works and why it feels like such a gatekept club. It turns out, if you know a bit of Python, you can actually peek behind the curtain.

Jan 01, 2019
software-engineering

Learning Systems Thinking by Diana Montalion - A Book Retelling Series

I just finished reading “Learning Systems Thinking” by Diana Montalion (O’Reilly, 2024, ISBN: 978-1-098-15133-1) and I want to share what I got from it. Chapter by chapter. Like a retelling with my own thoughts mixed in.

Jan 01, 2019
Big Data

Big Data for the Rest of Us: A Deep Look at Hadoop 3

So, you’ve heard about big data. It’s everywhere. But how do you actually handle it? If you’re looking for the OG of big data platforms, you’re looking at Hadoop. And honestly, it’s still the foundation for almost everything we do in data today.