NLP Pipeline Fundamentals: Data Pipelines, the Agentic Reliability Crisis, and Why Classic NLP Still Matters
Chapter 1 of Laura Funderburk’s book opens with something I wish more people in the AI space would say out loud: the era of pure experimentation with LLMs is over. We’re past the “look what ChatGPT can do” stage. The real question now is: can you trust this thing in production?
That sets the tone for the whole chapter. Let me walk you through it.
The Agentic Reliability Crisis
Here’s the problem Funderburk frames at the very start. We’re in 2026. Companies are moving AI agents from isolated pilots to real enterprise workflows. And things are breaking.
She calls this the agentic reliability crisis. An AI agent is only as good as the data and tools it gets. Feed it flawed, unverified, or messy data and you don’t just get bad answers. You get cascading hallucinations, wasted compute, and security holes where the agent takes damaging actions based on garbage input.
The book’s core thesis is simple: you don’t fix this with better prompts. You fix it with better data pipelines. The same boring, rigorous data engineering practices from classic data science? Those are the foundation for reliable agentic systems.
I like this take because it’s honest. Most AI discourse focuses on model capabilities. Funderburk says the plumbing matters more than the engine.
What Are Data Pipelines, Really?
A data pipeline is just a set of processes that move data from one system to another, transforming it along the way. You collect it, process it, store it, analyze it, model it, and serve the results. Nothing flashy.
Funderburk uses a coffee analogy that I actually think works well. Raw data is like whole coffee beans. You need to grind them, brew them, and serve them right. The method you use changes the result completely. Same with data.
She also brings in data mesh concepts from Martin Fowler. Four principles worth knowing:
- Domain-oriented ownership - the people closest to the data manage it
- Data as a product - treat your pipeline output like a real product with quality standards
- Self-serve infrastructure - give developers standard tools to build their own pipelines
- Federated governance - decentralized pipelines still follow global security and interoperability rules
Here’s the thing. In old-school data science, the consumer of your pipeline output was a human looking at a dashboard. A human can look at a weird chart and still figure things out. They have intuition.
But now the consumer is an AI agent. And agents have zero tolerance for ambiguity. For an agent, data is not an insight. It’s a command. Bad data doesn’t just produce a bad report. It produces cascading failures, hallucinations, and the agent doing things it was never supposed to do.
So the pipeline’s job has changed. It’s no longer just about transforming raw data for human insights. It’s about transforming raw data for reliable agentic reasoning.
Text Processing: The Classic Toolkit
Since this book focuses on natural language, Funderburk covers the key text processing techniques. Here’s the quick rundown:
Tokenization - breaking text into smaller pieces. “Hello my name is” becomes individual tokens. This is foundational for everything that follows.
Stop word removal - getting rid of words like “the,” “is,” and “a” that don’t carry much meaning. Just be careful you don’t remove words that change the context.
Stemming and lemmatization - reducing words to their base form. “Running” becomes “run.” Helps standardize the text.
Part-of-speech tagging - figuring out if a word is a noun, verb, adjective. Helps the system understand what role each word plays.
Named entity recognition (NER) - finding and classifying entities like people, organizations, and places in text.
Text normalization - converting everything to a consistent format. Lowercase, replace spaces, standard encoding.
TF-IDF - a statistical measure of how important a word is in a collection of documents. If “cat” shows up a lot in one document but rarely in others, it’s probably important for that document.
Text embeddings - converting words, sentences, or documents into vectors of numbers that capture semantic relationships.
The two most critical techniques for modern NLP pipelines are tokenization and embeddings. Tokenization breaks continuous text into discrete units that a model can track. Embeddings turn those units into numerical vectors that capture the actual meaning and relationships between words. Without these two steps, an LLM literally cannot see patterns in language.
Classic NLP Is Not Dead. It’s Reborn.
This is my favorite part of the chapter. A common misconception in 2025 is that LLMs make classic NLP techniques obsolete. Funderburk says the opposite is true.
Sure, an LLM can do NER or sentiment analysis. But here’s the problem: an LLM is a probabilistic, non-deterministic system. A fine-tuned classifier or a rule-based NER model? That’s predictable, fast, and cheap.
This distinction motivates a clean separation of concerns. You build deterministic pipelines using classic NLP techniques, then use those pipelines as tools for an LLM-powered agent. Best of both worlds. Reliability of tested approaches, plus the nuance of LLM reasoning.
This brings up a core architectural pattern the book keeps coming back to: tool layer versus orchestration layer.
- Orchestration layer - the brain. A reasoning engine like LangGraph that manages high-level logic and decides what to do.
- Tool layer - specialized, high-performance tools the orchestrator calls for specific tasks.
Classic NLP pipelines become the building blocks of these tools. Tokenization, stemming, NER, sentiment analysis. They’re not standalone artifacts from a previous era. They’re the components inside reliable agentic tools.
Instead of asking one massive LLM to detect locations AND do sentiment analysis AND answer questions, you build each as a separate microservice tool. The LLM orchestrator delegates to them. Much more robust, debuggable, and governable than a monolithic approach.
The takeaway: mastering classic data science skills (building pipelines for preprocessing, tokenization, embedding) is a prerequisite for building reliable agents. Not optional. Required.
In Part 2, we’ll get into tokenization types, how embeddings actually work, the dual role of LLMs in modern systems, and the evolution from classic pipelines to full agentic architectures.
This is post 2 of 24 in the Building Natural Language and LLM Pipelines series.
Based on Chapter 1 of “Building Natural Language and LLM Pipelines” by Laura Funderburk (ISBN: 978-1-83546-799-2, Packt Publishing, 2025).
Previous: Book Intro and Series Overview
Next: Chapter 1, Part 2: Tokenization, Embeddings, and the Dual Role of LLMs