Haystack Pipelines: Indexing, Multimodal Processing, and Your First RAG System
Chapter 4 is where you stop reading about components and actually start wiring them together. Laura Funderburk calls it “Bringing Components Together,” and that’s exactly what it is. You take all those building blocks from Chapter 3 and connect them into working pipelines.
Here’s the thing about Haystack pipelines. They are not simple linear chains. They are directed multigraphs. That sounds fancy, but it just means your data can flow in multiple directions at once. You can branch, loop, and merge. This is what makes Haystack flexible enough for real work.
The Six Steps to Any Pipeline
Funderburk lays out a clear recipe that works for every pipeline you’ll build:
- Pick your components for the job (embedders, retrievers, generators, etc.)
- Create the pipeline object with
Pipeline() - Add components using
pipeline.add_component(name, component) - Connect them with
pipeline.connect("source.output", "target.input") - Run it with
pipeline.run({"component": {"input": value}}) - Visualize it with
pipeline.draw("path/to/image.png")to get a Mermaid diagram
That last step is more useful than you’d think. When your pipeline has 10+ components with branches, being able to see the graph as a picture saves you a lot of debugging time.
Branching with Routers
Pipelines get interesting when you add routers. A ConditionalRouter can inspect each incoming query and send it down different paths based on Jinja2 conditions. For example, you could route factual questions (“What is X?”) to one prompt template, semantic questions (“How does X compare to Y?”) to another, and complex multi-part queries to a third.
There’s also a FileTypeRouter that sorts documents by MIME type, a MetadataRouter for routing based on document metadata, and a TextLanguageRouter for handling multilingual content. The pattern is the same: one input, multiple output paths.
The Indexing Pipeline
Before you can ask questions, you need to prepare your knowledge base. That’s what the indexing pipeline does. It is an offline process, and it is the foundation of everything.
Funderburk builds an indexing pipeline that handles web pages, text files, PDFs, and CSV files all at once. Here is how the data flows:
Stage 1: Sort by file type. The FileTypeRouter looks at each incoming file and sends it to the right lane: text/plain, application/pdf, text/html, or text/csv.
Stage 2: Process in two branches. Unstructured data (web, text, PDF) goes through converters, gets joined together, cleaned up, and split into 150-word chunks. Structured data (CSV) goes through a separate path where each row becomes its own document. This is key: the CSV splitter uses split_mode="row-wise" so each row like “Company: OpenAI, Model: GPT-4…” becomes a standalone searchable document.
Stage 3: Merge, embed, and store. Both branches merge back together. Every chunk gets vectorized by an embedding model, then written to InMemoryDocumentStore.
One smart detail: the pipeline has three layers of error handling. The FileTypeRouter ignores unknown file types. The LinkContentFetcher is set to skip broken URLs instead of crashing. And a custom DocumentSanitizer component filters out empty documents before they reach the splitter. Three simple guards that make the difference between a demo and something production-ready.
Naive RAG: The Baseline
Once your documents are indexed, you build the query pipeline. Naive RAG is the simplest version: retrieve documents, stuff them into a prompt, and let the LLM answer.
It has three stages:
Stage 1: Vectorize the query. The user’s question goes through SentenceTransformersTextEmbedder, the same model used during indexing. This turns the question into a vector in the same space as your documents.
Stage 2: Retrieve context. The InMemoryEmbeddingRetriever compares the query vector against all document vectors and returns the top 3 most similar ones.
Stage 3: Augment and generate. The PromptBuilder takes the original question and the retrieved documents and assembles a prompt like:
Given the following information...
Context:
[Document 1 content]
[Document 2 content]
Question: [The user's question]
Answer:
This goes to the LLM (like OpenAIGenerator), which reads the context and produces an answer grounded in your actual data.
Funderburk calls this a “glass box” architecture, and I think that’s a good description. Every connection is explicit. You can trace exactly how data flows from question to answer. No hidden magic. If something goes wrong, you know where to look.
The Problem with Naive RAG
But here’s the problem. Naive RAG relies 100% on semantic similarity. That works great for conceptual questions. It fails when your query depends on specific keywords, acronyms, or product codes.
Example: searching for a specific error code. The semantic embedder might not match the document containing that exact string if the overall context doesn’t seem similar enough. The words are right there in the document, but the vector math doesn’t care about exact string matches.
This is called the vocabulary mismatch problem, and it’s the reason Chapter 4 doesn’t stop at naive RAG. The next post covers hybrid retrieval, which fixes exactly this weakness by combining semantic search with keyword search.
This is post 9 of 24 in the Building Natural Language and LLM Pipelines series.