NLP Pipeline Fundamentals Part 2: Tokenization, Embeddings, LLM Roles, and the Road to Agentic Pipelines

In Part 1 we covered the agentic reliability crisis, what data pipelines are, and why classic NLP techniques are being reborn as tools for AI agents. Now let’s get into the specifics: how tokenization and embeddings actually work, what LLMs are, and the two very different roles they play in modern agentic systems.

Tokenization: More Than One Way to Split Text

Tokenization is the process of breaking continuous text into discrete units. Words, characters, sub-words, phrases. It sounds simple but there are real trade-offs between the approaches.

Word tokenization splits text on spaces and delimiters. “Hello my name is” becomes four tokens. The problem? It chokes on words not in its vocabulary. If it never saw “defenestration” during training, it has no idea what to do with it. These are called out-of-vocabulary (OOV) words.

Character tokenization breaks everything into individual characters. No OOV problem since every word is just a sequence of known characters. But the input length explodes and the model struggles to learn relationships between characters. Think about it: learning that “c-a-t” means a furry animal is much harder than just knowing the word “cat.”

Sub-word tokenization is the middle ground. It splits text into meaningful smaller units called n-gram characters. “unhappiness” might become “un” + “happi” + “ness.” This handles unknown words while keeping sequences manageable.

Byte pair encoding (BPE) is a popular sub-word method. It iteratively merges the most frequent character pairs. Most modern LLMs use some variant of this.

SentencePiece is a data-driven tokenizer used for neural network text generation. It treats the input as raw text without assuming word boundaries, which makes it work across different languages.

Each approach has drawbacks. There’s no perfect tokenizer. The choice depends on your language, domain, and performance needs.

Embeddings: Turning Words Into Math

Once text is tokenized, you need to turn those tokens into numbers. That’s where embeddings come in. An embedding converts a word, sentence, or document into a vector of numbers that captures semantic relationships.

Funderburk uses a simple example with four words. Imagine representing them as 3D vectors:

"king"  = [0.8, 0.6, 0.1]
"queen" = [0.8, 0.6, -0.1]
"man"   = [0.4, 0.2, 0.1]
"woman" = [0.4, 0.2, -0.1]

Two things to notice. First, words with similar meanings end up close together in vector space. “King” and “queen” have nearly identical vectors. Second, the relationships between vectors are consistent. The difference between “king” and “queen” is [0, 0, 0.2]. The difference between “man” and “woman”? Also [0, 0, 0.2]. The model has learned a “gender” dimension.

This is how embeddings capture meaning. Not through dictionary definitions, but through the geometric relationships between vectors.

Here’s something Funderburk really stresses: if you’re building a RAG pipeline, you must use the same embedding model for both indexing (storing your documents) and querying (searching them). Using different models is like “using a map of Paris to navigate the streets of Tokyo.” The vector spaces won’t match and your pipeline will fail completely.

And there’s a cost angle too. OpenAI’s text-embedding-3-large might give marginal performance gains over text-embedding-3-small, but it costs 6.5x more. Is that worth it? The book teaches you how to measure this with evaluation frameworks and observability tools so you can make data-driven decisions, not guesses.

LLMs: The Quick Version

Large language models are deep learning models designed to handle and generate human-like text. They have hundreds of millions to tens of billions of parameters. Most are based on the Transformer architecture from the famous “Attention Is All You Need” paper.

They’re pre-trained on massive text corpora, then can be fine-tuned for specific tasks. Well-known examples: OpenAI’s GPT, Google’s BERT.

Their strength is transfer learning. Knowledge from one task transfers to another. Their weakness? They need serious computational resources and they sometimes hallucinate. That’s where techniques like RAG come in, grounding the LLM’s responses in real data from a document store.

The Dual Role of LLMs in Agentic Systems

This is the section that really clicked for me. In a modern agentic workflow, an LLM is not just one thing. It plays two completely different roles in two architectural layers.

Role 1: The LLM as a Specialist (Tool Layer)

Inside a RAG pipeline, the LLM acts as a constrained, specialized engine. It’s not allowed to reason freely. Its job is strictly defined:

  1. It receives a prompt that’s been augmented with verified, retrieved context
  2. It synthesizes an answer using only that provided context

That’s it. No freestyling. No making things up from general knowledge. By constraining the LLM this way, you transform it from an unpredictable creative engine into a reliable, deterministic tool. Then you package the entire pipeline as a microservice, deploy it with Docker, and it becomes a production-grade tool that other systems can call.

At this layer, the LLM is not the brain. It’s a high-performance part inside a reliable machine.

Role 2: The LLM as a Generalist (Orchestration Layer)

A different LLM acts as the brain of the whole agentic system. Think LangGraph. This LLM’s job is not to know facts or synthesize data. Its job is to reason, plan, and delegate.

The orchestrator gets a user goal (“Find me the best-rated restaurant and book a table”), a belt of available tools, and a prompt with instructions. Then it works through a loop:

  1. Thought: “I should search for restaurant options first”
  2. Action: calls the search tool
  3. Observation: receives results
  4. Thought: “Now I need to book one”
  5. Action: calls the booking API tool

The tools it calls are not simple functions. They’re the robust microservices you built at the tool layer. Your RAG pipeline. Your NER service. Your sentiment analyzer.

Shallow Agents vs. Deep Agents

The book makes a useful distinction between two classes of agents.

Shallow agents are simple loops around an LLM. Receive, reason, respond. They use only the context window as memory. Fine for simple tasks but they fall apart on complex, multi-step problems. Context overflow, goal loss, and no error recovery.

Deep agents are hierarchical systems designed for reliability. They have three key properties:

  • Hierarchical delegation - they delegate to specialized tools instead of doing everything themselves
  • Explicit planning - they use orchestrators to create and maintain structured plans
  • Persistent memory - they use external memory (like vector databases) to overcome context window limits

The book’s central argument: a reliable deep agent is impossible without a reliable tool layer. The agent’s reasoning is only as good as the tools it delegates to.

The MLOps/AgentOps Lifecycle

The chapter closes with the modern project lifecycle. It’s no longer a linear path from scoping to deployment. It’s a continuous loop. Build, evaluate, improve, redeploy. Repeat.

The book maps out the journey from pipeline to production agent across several pillars: reliability (RAGAS evaluation, Weights & Biases observability), scalability (Docker, Hayhooks, Kubernetes), interoperability (MCP and A2A protocols), and self-improvement (context engineering, LangSmith).

The shift from “it works on my machine” to enterprise-grade systems. That’s what the rest of the book teaches you to do.

Next up: Chapter 2, where Funderburk goes deep on how large language models actually work. Transformers, attention mechanisms, context engineering, and the techniques for reducing costs when working with non-private LLMs.


This is post 3 of 24 in the Building Natural Language and LLM Pipelines series.

Based on Chapter 1 of “Building Natural Language and LLM Pipelines” by Laura Funderburk (ISBN: 978-1-83546-799-2, Packt Publishing, 2025).

Previous: Chapter 1, Part 1: Data Pipelines and the Agentic Reliability Crisis

Next: Chapter 2, Part 1: Diving Deep into Large Language Models

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More