Knowledge Graphs, Synthetic Test Data, and Multi-Source Pipelines in Haystack

In the last post we learned the rules for building custom Haystack components. Now Funderburk puts those rules to work on a real problem: building a pipeline that creates a knowledge graph from your documents and then generates synthetic test questions from that graph.

This is the part where custom components stop being a toy example and become genuinely useful.

The KnowledgeGraphGenerator Component

This component takes a flat list of documents and turns them into a structured knowledge graph using the Ragas framework. Here’s how it works at a high level:

The __init__ stores configuration: which LLM to use, which embedding model, and whether to apply graph transforms. The run() method does the real work. It creates a KnowledgeGraph object, adds each document as a node, then calls apply_transforms to build the intelligence layer.

kg = KnowledgeGraph()
for doc in documents:
    kg.nodes.append(Node(
        type=NodeType.DOCUMENT,
        properties={"page_content": doc.page_content,
                     "document_metadata": doc.metadata}
    ))
apply_transforms(kg, default_transforms(
    documents=documents, llm=llm, embedding_model=embeddings))

The apply_transforms step is where the magic happens. The LLM scans each node to extract entities (people, companies, concepts) and identify relationships between them. The embedding model measures semantic similarity to make sure connections are accurate. The result is a web of linked information, not just a pile of text chunks.

You can save the graph as JSON using a companion KnowledgeGraphSaver component, or pass it directly to the next stage.

The SyntheticTestGenerator Component

This is the component that takes a knowledge graph and produces question-answer pairs for evaluating your RAG system. It uses Ragas’ TestsetGenerator under the hood.

The key configuration is query_distribution. This tells the component what mix of question types to generate:

query_distribution = [
    ("single_hop", 0.25),
    ("multi_hop_specific", 0.25),
    ("multi_hop_abstract", 0.50)
]

Single-hop questions are simple fact lookups. “Who is Christopher Ong in the context of ChatGPT research?” One piece of information, one answer.

Multi-hop specific questions require connecting facts from different parts of the text. “What are the key improvements in Haystack 2.0 regarding loops and customizable components?” You need info from multiple sections to answer this.

Multi-hop abstract questions need broad reasoning. “How does ChatGPT usage for practical guidance differ among users with varying education levels?” These are the hardest and the most valuable for testing.

Here’s the thing about the design. The component has a fallback mechanism. It tries to generate questions from the knowledge graph first, because that produces the best multi-hop questions. But if the graph generation fails, it falls back to generating simpler questions directly from documents:

try:
    testset = self._generate_from_knowledge_graph(knowledge_graph)
except Exception as kg_error:
    logger.warning(f"KG generation failed: {kg_error}")
    testset = self._generate_from_documents(documents)

This is solid engineering. The pipeline never crashes. It degrades gracefully.

The Bridge Component

There’s a practical problem to solve first. Haystack outputs List[HaystackDocument] objects. Ragas expects List[LangChainDocument] objects. Different frameworks, different data types.

The solution is a DocumentToLangChainConverter component. Its run() method just iterates through Haystack documents, copies the content and metadata, and creates LangChain document objects. Simple adapter pattern. But without it, nothing connects.

This is a pattern you’ll see a lot in real pipelines. When you integrate external libraries, you often need a bridge component to translate between data formats.

Wiring It All Together

The full pipeline for processing PDFs has four stages:

Stage 1 - Ingestion: PyPDFToDocument reads the PDF. DocumentCleaner removes noise. DocumentSplitter breaks it into sentence-based chunks.

Stage 2 - Format bridge: DocumentToLangChainConverter translates the data format.

Stage 3 - Knowledge graph: KnowledgeGraphGenerator builds the structured graph.

Stage 4 - Test generation: SyntheticTestGenerator creates the question-answer pairs. TestDatasetSaver writes them to CSV.

pipeline.connect("pdf_converter.documents", "doc_cleaner.documents")
pipeline.connect("doc_cleaner.documents", "doc_splitter.documents")
pipeline.connect("doc_splitter.documents", "doc_converter.documents")
pipeline.connect("doc_converter.langchain_documents", "kg_generator.documents")
pipeline.connect("doc_converter.langchain_documents", "test_generator.documents")
pipeline.connect("kg_generator.knowledge_graph", "test_generator.knowledge_graph")
pipeline.connect("test_generator.testset", "test_saver.testset")

Notice that test_generator receives inputs from two sources: documents (for the fallback path) and the knowledge graph (for the preferred path). That dual-input design is what makes the fallback mechanism work.

Swapping Data Sources

Here’s where Haystack’s modularity pays off. Want to process a website instead of a PDF? Just swap the “head” of the pipeline. Replace PyPDFToDocument with LinkContentFetcher plus HTMLToDocument. Everything downstream stays the same.

The book goes further: building a branching pipeline that processes both PDFs and websites simultaneously. A DocumentJoiner merges the results from both branches, and the rest of the pipeline is identical. The knowledge graph and test generator components don’t know or care where the data came from.

Testing Custom Components

The chapter wraps up with testing principles. Five key ideas:

1. Mock external dependencies. Don’t call real LLMs in tests. Use @patch to simulate them. Tests should be fast, free, and network-independent.

2. Validate the lifecycle. Test that the model is None after __init__, that warm_up() loads it, and that calling warm_up() twice doesn’t reload it (idempotency).

3. Test configuration. Verify defaults work. Verify custom parameters get stored correctly.

4. Test bridge components. Make sure data format conversions are accurate. Content and metadata should survive the translation.

5. Test edge cases. What happens with an empty document list? The component should return an empty result, not crash. Graceful failure keeps the pipeline alive.

These aren’t Haystack-specific testing ideas. They’re standard software engineering practices applied to ML pipeline components. But that’s kind of the point. Funderburk keeps pushing the idea that building with LLMs should be regular engineering, not black-box experimentation.

Next chapter builds a full production RAG system and evaluates it with the synthetic test data we just learned to generate.


This is post 12 of 24 in the Building Natural Language and LLM Pipelines series.

Previous: Chapter 5: Custom Components - Part 1

Next: Chapter 6: Production RAG - Part 1

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More