Hybrid RAG: Parallel Retrieval, Fusion, Reranking, and Multimodal Pipelines
In the last post we built a naive RAG pipeline. It works, but it has a blind spot: it only understands meaning, not exact words. Search for error code “ERR-4052” and the semantic retriever might miss the one document that contains that exact string. This is the vocabulary mismatch problem, and hybrid RAG is how you fix it.
Two Search Methods, One Pipeline
Hybrid retrieval combines two search paradigms that have complementary strengths:
Sparse retrieval (BM25) is keyword matching. It counts how often your query terms appear in documents, weighted by rarity. Perfect for finding specific names, jargon, product codes, or error messages. Its weakness: it has zero understanding of synonyms or concepts. Search for “car” and it won’t find documents about “automobiles.”
Dense retrieval (embeddings) is semantic matching. It understands meaning, intent, and conceptual relationships. Search for “car” and it will find documents about vehicles, transportation, driving. Its weakness: it sometimes misses the exact literal terms that matter most.
By running both at the same time and combining results, you get the best of both worlds. Funderburk describes it as having a “semantic librarian” working alongside a “keyword expert.”
The Four Stages of Hybrid RAG
Stage 1: Parallel retrieval. The user’s question goes down two paths simultaneously. Path one converts it to a vector and runs InMemoryEmbeddingRetriever to find semantically similar documents. Path two sends the raw text to InMemoryBM25Retriever to find keyword matches. Each returns its top 3 documents.
This is where Haystack’s directed multigraph design pays off. Running parallel branches isn’t a hack or workaround. It’s a native capability of the pipeline architecture.
Stage 2: Fusion. Now you have two lists of documents, up to 6 total, that may or may not overlap. DocumentJoiner merges them into a single candidate list.
Stage 3: Reranking. Here’s the thing that makes hybrid RAG really work. The merged list isn’t ordered by true relevance yet. The TransformersSimilarityRanker is a cross-encoder model that takes each document paired with the original query and does a deep relevance analysis. It is more computationally expensive than the initial retrievers, but far more accurate. It re-scores and re-orders the entire list, then returns only the top 3.
Think of it as a “head librarian” doing a final quality check on what the two search experts found.
Stage 4: Augment and generate. Same as naive RAG from here. PromptBuilder assembles the question and top-ranked documents into a prompt. The LLM reads the context and generates the answer. But now the context is better, because it came from two retrieval methods and got filtered by a reranker.
SuperComponents: Wrapping Pipelines as Components
Once you have working pipelines, Funderburk shows how to wrap them into reusable units called SuperComponents. This is a practical pattern for production code.
The simplest approach wraps an existing pipeline instance:
from haystack import SuperComponent
naive_rag_sc = SuperComponent(
pipeline=naive_rag_pipeline,
input_mapping={"query": [
"text_embedder.text",
"prompt_builder.question"
]},
output_mapping={
"llm.replies": "replies",
"retriever.documents": "documents"
}
)
Before wrapping, calling the hybrid pipeline required passing the question to four different components manually. After wrapping, it’s just:
hybrid_rag_sc.run(query=question)
The second method is more powerful: define a class with the @super_component decorator. This lets you accept configuration parameters like which embedding model to use, which LLM, and how many documents to retrieve. You can then instantiate the same architecture with different configurations. One version connects to Elasticsearch, another to an in-memory store. One uses OpenAI, another uses a local model via Ollama. Same blueprint, different settings.
Multimodal Pipelines: Beyond Text
Real-world knowledge is not all text. It is images, diagrams, audio recordings. Funderburk covers two strategies for handling this.
Strategy 1: Joint embeddings with CLIP. Models like CLIP can project images and text into the same vector space. You embed an image and a text description, and they end up as nearby vectors. The indexing pipeline uses FileTypeRouter to send PDFs down a text branch and images down an image branch. Both end up in the same document store, searchable with a single text query.
Strategy 2: LLM content extraction. CLIP is good for visual similarity, but it can’t read text in images or analyze charts. For that, you send images to a vision LLM (like GPT-4o) which generates a detailed text description. Then you embed that description with a regular text embedder. At query time, you search by the text description but pass the original image to the LLM for generation. Funderburk calls this “search by proxy, answer by source.” The text description is just the index key. The LLM does fresh visual reasoning on the actual image.
For audio, it is simpler: RemoteWhisperTranscriber converts speech to text, then DocumentSplitter chops the transcript into sentence-level chunks for embedding.
Async Pipelines for Production
The chapter closes with a practical note on performance. In a synchronous pipeline, if dense retrieval takes 0.5 seconds and sparse retrieval takes 0.5 seconds, the user waits 1 second. With AsyncPipeline, Haystack analyzes the graph, identifies independent branches, and runs them concurrently. Dense and sparse retrieval execute at the same time, so the total wait drops to 0.5 seconds.
The API is nearly identical. You swap Pipeline for AsyncPipeline and use run_async() instead of run(). There is also run_async_generator() for streaming, which yields partial results as each component finishes. Useful for chat applications where perceived latency matters.
What Sticks
Chapter 4 takes you from understanding components to building complete systems. The progression is logical: first index your data, then query it simply (naive RAG), then query it better (hybrid RAG), then wrap it up cleanly (SuperComponents), then extend to other data types (multimodal), and finally make it fast (async).
The vocabulary mismatch problem is the kind of thing you don’t think about until it bites you in production. A user searches for a specific product code and gets back irrelevant results because the embedder found something semantically close but factually wrong. Hybrid retrieval is the fix, and the reranker is the safety net.
Next up: Chapter 5, where Funderburk shows how to build custom Haystack components from scratch.
This is post 10 of 24 in the Building Natural Language and LLM Pipelines series.
Previous: Chapter 4: Haystack Pipelines - Part 1