Measuring RAG Quality With RAGAS and Weights & Biases: Evaluation, Observability, and Cost-Performance Tradeoffs

In Part 1, we covered how Funderburk moves from Jupyter notebooks to a production-ready project structure. Docker, uv, SuperComponents, dual Elasticsearch. Now comes the part that actually tells you if your RAG pipeline is any good: systematic evaluation with RAGAS and continuous monitoring with Weights and Biases.

Why You Can’t Just “Spot Check” a RAG System

Here’s the problem with RAG systems. You ask it a question, it gives you an answer that sounds right, and you move on. But “sounds right” is not a quality metric. Your pipeline might be hallucinating. It might be pulling irrelevant documents. It might answer some question types well and completely fail on others.

Funderburk’s point is clear: you need numbers. Not vibes. And the way you get numbers is with a systematic evaluation framework.

RAGAS: Four Metrics That Actually Tell You Something

Remember the synthetic test dataset from Chapter 5? The knowledge graph component that generates question-answer pairs from your documents? This is where it pays off. You now have a “golden” test set of questions with known correct answers. You feed the same questions to your RAG pipeline and compare.

RAGAS is the evaluation framework that does this comparison. It measures four things:

Faithfulness checks if the generated answer actually comes from the retrieved documents. Low faithfulness means your LLM is making stuff up. It’s hallucinating. The context says one thing, the answer says something else.

Context precision measures the signal-to-noise ratio of your retrieved documents. Did the retriever grab chunks that are actually relevant to the question? Or is it pulling back a bunch of loosely related noise?

Context recall asks whether the retriever found all the information needed to answer the question fully. Even if what it found is relevant, did it miss important pieces?

Answer relevancy is the end-to-end check. Is the final answer actually useful for the question that was asked? This captures the whole pipeline, not just retrieval or generation in isolation.

Naive RAG vs. Hybrid RAG: The Numbers

Funderburk runs both the NaiveRAGSuperComponent and HybridRAGSuperComponent against the same synthetic dataset. The evaluation workflow is straightforward: for each question, run both pipelines, collect answers and retrieved documents, feed everything into the RagasEvaluationComponent, store the scores.

Here’s what a sample run with 10 questions looks like:

MetricNaive RAGHybrid RAGImprovement
Faithfulness0.640.96+50%
Answer relevancy0.670.74+10%
Context recall0.680.76+12%
Factual correctness0.360.41+15%

That faithfulness jump is the headline. From 0.64 to 0.96. The naive RAG pipeline was hallucinating on about a third of its answers. The hybrid pipeline with reranking barely hallucinated at all.

Why? The architectural difference from Chapter 4. Naive RAG uses a single dense retriever. It’s good at semantic/conceptual queries but bad at keyword-specific ones. Hybrid RAG runs both a dense retriever and a sparse BM25 retriever in parallel, joins the results, and then reranks them with a cross-encoder. It handles both conceptual and keyword queries well.

The reranker is the key addition. After the hybrid retrieval pulls candidate documents, the SentenceTransformersSimilarityRanker does a final high-precision reordering. Only the most relevant documents reach the LLM. Less noise in, less hallucination out.

From Static Tests to Live Monitoring

RAGAS gives you a snapshot. It’s like a unit test. You run it before deployment, you get a quality score, you decide if you’re ready to ship.

But what happens after deployment? That’s where Weights and Biases comes in. Funderburk makes a clear distinction:

  • Evaluation (RAGAS) = static, one-off test against a known dataset
  • Observability (W&B) = continuous, real-time monitoring of the live system

Haystack integrates with Weights and Biases through the WeaveConnector component. It’s a nice pattern. The connector doesn’t need explicit .connect() calls in the pipeline. You set environment variables like HAYSTACK_CONTENT_TRACING_ENABLED and WANDB_API_KEY, and it automatically intercepts traces from all components. Inputs, outputs, metadata. Everything gets sent to the Weights and Biases dashboard.

The FinOps Dashboard: Tying Performance to Cost

Here’s the part I think is most practical for production teams. Funderburk includes a rag_analytics.py script that goes beyond basic monitoring. It tracks token counts and calculates the actual dollar cost of each pipeline run based on model pricing.

This turns Weights and Biases from a generic ML monitoring tool into what she calls a FinOps dashboard. Your product owner can log in and answer real business questions:

  • What was our total RAG pipeline cost yesterday?
  • What’s the average cost per query?
  • Which pipeline (small or large embedding) is more cost-effective?
  • Are we seeing cost spikes on certain query types?

That’s the kind of visibility that makes the difference between “we have AI in production” and “we have AI in production and we understand what it costs.”

Small vs. Large Embeddings: The Cost-Performance Verdict

The dual Elasticsearch architecture from Part 1 enables a direct comparison between text-embedding-3-small and text-embedding-3-large.

The cost difference is stark. The small model costs $0.02 per million tokens. The large model costs $0.13 per million tokens. That’s a 6.5x increase.

And the performance difference? On the MTEB benchmark, the large model scores 64.6% versus 62.3% for the small model. A 2.3-point improvement. About 3.7% relative gain.

The numbers from Funderburk’s actual pipeline runs tell a similar story:

MetricSmall EmbeddingLarge Embedding
Faithfulness0.800.75
Context recall0.930.95
Factual correctness0.600.58
Response relevancy0.870.87
Avg cost/query$0.00094$0.00143

The large model is slightly better at context recall and entity recall. But it’s actually worse at faithfulness and factual correctness in this run. And it costs 52% more per query.

Funderburk’s conclusion is practical: for most general-purpose RAG applications, text-embedding-3-small is the clear winner. The large model should be reserved for high-stakes domains like legal, medical, or financial RAG where the extra nuance in semantic understanding justifies the exponential cost increase.

The Architect’s Toolkit Is Complete

By the end of Chapter 6, you have a complete production system. Dependencies locked with uv. Infrastructure containerized with Docker. Pipelines abstracted as SuperComponents. Quality measured with RAGAS. Costs tracked with Weights and Biases. And a framework for comparing different models and architectures with real data.

Funderburk puts it well: you’re no longer just a developer. You’re an architect. And with this foundation, the next step is deployment. Chapter 7 takes the whole thing and turns it into a scalable API.


This is post 14 of 24 in the Building Natural Language and LLM Pipelines series.

Based on Chapter 6 of “Building Natural Language and LLM Pipelines” by Laura Funderburk (ISBN: 978-1-83546-799-2, Packt Publishing, 2025).

Previous: Chapter 6, Part 1: From Notebooks to Production RAG

Next: Chapter 7, Part 1: Deploying Haystack Applications

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More