Haystack 2.0 by deepset: Components, Pipelines, Document Stores, and Retrievers
Chapter 3 of Laura Funderburk’s book is where the rubber meets the road. We stop talking theory and start looking at an actual framework you can use to build real NLP pipelines. That framework is Haystack 2.0 by a company called deepset.
Who Is deepset?
deepset was founded in 2018 by Milos Rusic, Malte Pietsch, and Timo Moller. They started around the same time BERT models were becoming a big deal. Their original project was called FARM, a framework for fine-tuning transformer models. They also published popular Hugging Face models like roberta-base-squad2.
In 2021, FARM got folded into Haystack, their main framework. Then when the industry shifted from fine-tuning specialized models to using general-purpose LLMs, deepset completely rewrote Haystack from scratch. That rewrite is Haystack 2.0.
What Changed from 1.x to 2.0
Here’s the thing about Haystack 1.x: it was built for a different era. It used YAML files and implicit data passing between nodes in a linear sequence. Hard to debug, hard to extend.
Haystack 2.0 is a totally different animal. The key changes:
- Pipeline definition: went from YAML and implicit naming to pure Python with explicit
add_component()andconnect()methods - Component definition: went from inheriting a
BaseComponentclass to just slapping a@componentdecorator on any Python class - Data flow: went from passing dictionaries around (good luck tracking that) to typed input/output sockets with strict contracts
- Core abstraction: went from a linear sequence to a directed graph (DG) that supports branching, merging, and loops
- Debugging: went from needing framework-specific knowledge to standard Python debugging plus a
.draw()method that renders your pipeline as a visual graph
The big idea: everything is explicit. You can see exactly where data comes from and where it goes. No magic.
The Building Blocks
Haystack 2.0 has a clear hierarchy. Let me walk through each layer.
Components
A component is a Python class that does one thing. Clean text, embed a query, retrieve documents, call an LLM. Whatever. It just needs two things:
- The
@componentdecorator on the class - A
run()method that does the work
What makes this robust is typed sockets. Each component declares exactly what types of data it takes in and what it produces. A retriever might take a str query and output a List[Document]. The framework checks these types, so you catch mismatches before runtime.
@component
class MyRetriever:
@component.output_types(documents=List[Document])
def run(self, query: str):
# fetch relevant docs
return {"documents": results}
Pipelines
Pipelines wire components together into a graph. Three steps:
pipe = Pipeline()
pipe.add_component("retriever", MyRetriever(doc_store))
pipe.add_component("generator", OpenAIGenerator())
pipe.connect("retriever.documents", "generator.documents")
That connect() call is the whole point. You’re explicitly saying “the documents output of the retriever feeds into the documents input of the generator.” No ambiguity.
And you can call pipe.draw("my_pipeline.png") to get a visual diagram of what you just built. Super useful for debugging.
SuperComponents
As your pipelines grow, you’ll notice the same patterns repeating. A document indexing workflow might always chain together a file converter, cleaner, splitter, embedder, and writer. Rebuilding that every time is tedious.
SuperComponents let you wrap a whole sub-pipeline into a single reusable component. You expose just the inputs and outputs the outside world needs. Then you drop it into a larger pipeline with one line:
pipe.add_component("indexer", IndexingSupercomponent())
Think of it like functions in programming. Same idea, applied to data pipelines.
How Retrieval Works: Sparse vs Dense
A massive chunk of Chapter 3 covers how RAG systems actually find relevant documents. And here’s the problem: there’s no single search method that handles everything.
Sparse retrieval (BM25) is keyword-based search. It counts how many query words appear in each document and ranks them by that. It’s fast, needs no training, and nails exact-match queries like error codes or product names. But if someone searches “AI safety concerns” and your document says “risks of artificial intelligence,” BM25 will miss it. It knows what words are but not what they mean.
Dense retrieval (embeddings) uses a transformer model to convert text into vectors that capture meaning. Two texts about the same concept will have similar vectors even if they use totally different words. Great for conceptual queries. But it might miss exact strings because it’s focused on the big picture rather than specific tokens.
Here’s the bottom line: their weaknesses are complementary. Where one fails, the other usually succeeds.
Hybrid Retrieval: Best of Both
The best RAG systems combine both methods. Funderburk calls this the pipeline approach to hybrid retrieval. It works in three steps:
- Parallel fetch: send the query to both a BM25 retriever and an embedding retriever at the same time
- Fusion: merge both result lists using something like Reciprocal Rank Fusion (RRF), which reranks documents based on their position in each list
- Reranking: pass the merged list through a cross-encoder model that looks at the query and each document together for a final, high-accuracy ranking
This maps cleanly onto Haystack’s architecture. Each step is a component. The parallel branches are possible because the pipeline is a graph, not a linear sequence. The fusion is a joiner component. The reranker is another component. Everything connects with typed sockets.
And then the whole thing gets passed to a prompt builder and an LLM generator for the final answer.
What I Like About This Chapter
Funderburk does a good job explaining why Haystack 2.0 exists, not just what it does. The comparison table between 1.x and 2.0 makes the design philosophy clear. And the retrieval section gives you enough depth to actually understand the trade-offs without drowning in math.
The typed socket system is the real star here. Most frameworks pass data around implicitly and you end up debugging weird runtime errors when something unexpected flows through. Haystack says “no, declare your contracts up front.” That’s the kind of decision that matters in production.
In the next post, we’ll look at how Haystack handles agentic systems, the full component catalog (preprocessing, embedding, generation, routing), and practical advice for incorporating Haystack into your workflow.
This is post 7 of 24 in the Building Natural Language and LLM Pipelines series.