Custom Haystack Components: The @component Decorator, Input/Output Contracts, and warm_up
Chapter 5 is where Funderburk says: stop being a user of Haystack. Start being an architect. Up until now, the book has been about plugging together existing components. Now you learn to build your own.
Here’s the thing. Every real project eventually hits a wall where the built-in components don’t cover your specific use case. Maybe you need to call an external API. Maybe you need a custom data transformation. Maybe you need to load a huge ML model efficiently. This chapter teaches you the rules so your custom code plays nicely inside a Haystack pipeline.
The Four Requirements
To make a Python class work as a Haystack component, you need four things:
1. The @component decorator. Slap this on your class and Haystack recognizes it as a pipeline-compatible component. Without it, the pipeline engine doesn’t know your class exists.
2. An __init__ method. Standard Python constructor. This is where you pass in configuration that stays the same between runs. API keys, model names, connection strings. Keep it lightweight.
3. A run() method. This is where the actual work happens. Every component must have one. And here’s the key rule: it must return a Python dictionary. The keys in that dictionary become your output sockets.
4. The @component.output_types decorator. This goes on your run() method and declares what types come out. The names and types must match what run() actually returns.
That’s it. Four requirements. Follow them and your custom class plugs right into any pipeline.
Input/Output Contracts: Sockets
This is actually one of the best design decisions in Haystack 2.0. In the old version (1.x), data just got passed around in dictionaries and you had to hope everything lined up. Now it uses explicit “sockets” with type checking.
Inputs are defined by the parameters of your run() method. If run() takes documents: List[Document], that creates an input socket named “documents” that expects a list of Document objects. Haystack validates the types at connection time.
Outputs are declared with @component.output_types(). If you say @component.output_types(documents=List[Document]), that creates an output socket named “documents.”
This means you get errors early, when you connect components, not late, when data is flowing and something breaks. Think of it as type-safe plumbing for your data pipeline.
The Prefixer: Your First Custom Component
The book starts simple. The Prefixer component takes a list of documents and adds a text prefix to each one. Here’s the simplified idea:
@component
class Prefixer:
@component.output_types(documents=List[Document])
def run(self, documents: List[Document], prefix: str):
modified = []
for doc in documents:
new_doc = Document(
content=f"{prefix}{doc.content}",
meta=doc.meta # preserve metadata
)
modified.append(new_doc)
return {"documents": modified}
Two things to notice. First, it creates new Document objects instead of modifying the originals. This is immutable processing, a best practice that avoids weird side effects in complex pipelines. Second, it copies the metadata from the original documents. Losing metadata downstream is a common bug.
You can test it standalone first, then wire it into a pipeline with a DocumentWriter:
pipeline = Pipeline()
pipeline.add_component("prefixer", Prefixer())
pipeline.add_component("writer", DocumentWriter(document_store=store))
pipeline.connect("prefixer.documents", "writer.documents")
The connection string "prefixer.documents" refers to the output socket of the prefixer component named “documents.” Clean and explicit.
The warm_up() Method: Loading Heavy Resources
Here’s a problem. Your component needs a big ML model. Where do you load it?
If you load it in __init__, creating an instance of your class triggers a massive download. That’s slow and annoying just to set up a pipeline. If you load it in run(), it reloads the model every single time data flows through. That’s catastrophically slow.
Haystack’s answer is warm_up(). It’s a lifecycle hook that the pipeline calls exactly once, right before the first run(). Perfect place for heavy initialization.
The pattern splits your component into three stages:
Configuration (__init__): Store the model name. Don’t load anything.
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model_name = model_name
self.model = None # not loaded yet
Initialization (warm_up): Load the model. Check if already loaded to be safe.
def warm_up(self):
if self.model is None:
self.model = SentenceTransformer(self.model_name)
Processing (run): Use the pre-loaded model. Check it exists first.
def run(self, documents):
if self.model is None:
raise RuntimeError("warm_up() not called")
# use self.model to process documents
The beauty is you never call warm_up() yourself. When you run pipeline.run(...), Haystack automatically calls warm_up() on every component that has one, exactly once, before any data starts flowing. After that, the model is loaded and ready for every batch.
Why This Matters: Graph RAG and Evaluation
The chapter sets up something bigger. All these custom component skills are building toward a practical goal: creating a knowledge graph from your documents and then generating synthetic test data from that graph.
Why a knowledge graph? Because standard RAG with simple vector search struggles with complex questions that need information from multiple places. A knowledge graph stores entities and their relationships explicitly. You can traverse it, reason over it, and generate multi-hop questions that actually test whether your RAG system can think, not just retrieve.
Funderburk introduces the Ragas framework here. It’s an open source library for evaluating RAG pipelines across four dimensions: faithfulness (is the answer grounded in context?), response relevancy (does it answer the question?), context precision (is the retrieved context useful?), and context recall (did it find everything needed?).
The plan is to build two custom components: KnowledgeGraphGenerator and SyntheticTestGenerator. The first builds a structured graph from your documents. The second walks that graph to create question-answer pairs of varying complexity, from simple single-hop facts to complex multi-hop reasoning questions.
That’s the next post.
This is post 11 of 24 in the Building Natural Language and LLM Pipelines series.
Previous: Chapter 4: Haystack Pipelines - Part 2