Token Economics, System Integrity Under Failure, and the Sovereign Agent Stack

In the previous post, we looked at how agentic architectures evolved from brittle sequential chains (V1) through router patterns (V2) to resilient supervisors (V3). Now Funderburk puts those architectures under stress. The results are honestly a little scary.

The Token Economics

Talk is cheap. Architecture debates need numbers. So the book runs controlled experiments comparing V1 and V2 token usage across different query types. The results are dramatic.

For simple queries like “Italian restaurants in Boston,” V2 uses about 33% fewer tokens than V1. That’s already significant. But for complex queries that involve reviews and sentiment analysis, V2 achieves a 74% reduction. V1 consumed nearly 6,000 tokens for a single interaction. V2 did the same job with about 1,500.

Query TypeV1 TokensV2 TokensReduction
General search~1,650~1,10533%
Detailed search~3,695~1,48560%
With reviews~5,880~1,49075%

Here’s how it works. In V1, every node reads the entire accumulated state, including all the raw JSON from previous tool calls. The supervisor approval node has to re-read everything to check whether the work is done. In V2, the supervisor just checks Boolean flags. Did search return data? True or false. The actual data stays hidden from the supervisor and is only passed between worker nodes.

The key technique is summarization. Raw API JSON gets compressed into natural language by the summary node before entering the conversation history. The data passes between workers via a pipeline_data field that the supervisor never sees. This keeps the context window clean.

And here’s the thing: this context bloat isn’t just an orchestration problem. The V1 tool layer is also naive. The sentiment and detail microservices take entire search results as input when they only need a subset. Funderburk left this unaddressed on purpose to make the point that context engineering isn’t just about your agent architecture. It reaches all the way down into how your tools handle data too.

When Microservices Go Down

Token efficiency makes the economic argument. But the real reason to separate tools from orchestration is integrity. System integrity means: what does your agent do when things break?

The book runs a brutal experiment. Turn off all the Hayhooks microservices. Then throw the same questions at all three architectures using three different open-weight models: GPT-OSS 20B, DeepSeek-R1, and Qwen 3. See what happens.

V1 and V2 are a disaster. Some models hit recursion limits and crash. Others enter “deep retry” loops where the search node keeps telling the supervisor it failed, and the supervisor keeps sending it back. But here’s the worst part: in V1 and V2, when the timeout or retry limit is finally reached, the summary node receives “no data was found” and then hallucinates a complete response. Full business listings. Phone numbers. Websites. Star ratings. All completely fabricated. And in V1, the supervisor approved the hallucinated output.

Every time. Regardless of the model. Regardless of temperature setting.

V3 behaves completely differently. Every model, every query, every time: the agent exits gracefully with a message saying the service is unavailable, please try again later. That’s it. No hallucinated businesses. No fake phone numbers. No approved garbage.

Here’s the execution path that V3 takes:

  1. Guardrails node scans for PII/injection. Pass.
  2. Clarify node determines user intent. Pass.
  3. Supervisor plans to call search tool. Pass.
  4. Search node makes HTTP request. FAIL (503). Retries 3 times with backoff. Still fails. Returns structured error signal.
  5. Supervisor reads the failure flag. Triggers circuit breaker. Routes to safe exit.

The critical difference is step 5. V3’s supervisor doesn’t rely on the LLM’s judgment about whether to keep trying. It reads deterministic state flags. If consecutive_failures exceeds the threshold, the code routes to finalize. No prompt involved. No reasoning. Pure if-else logic.

Why LLMs Hallucinate Under Pressure

Funderburk makes a really honest observation here. She says she designed the experiment specifically to frustrate the language model. And she explains exactly why hallucination happens.

The nodes constrained by deterministic code (Python commands, Haystack pipelines) didn’t hallucinate. They returned clean error messages. The node that “caved” was the one constrained only by a subjective prompt. The summary node was told it must answer the user’s question. When it received “no data found” but its instructions said “be helpful,” the model’s reinforcement learning kicked in. Being helpful was prioritized over being honest.

This maps to real-world disasters. The NYC lawyer who used ChatGPT to find legal precedents and got fabricated court cases. The Air Canada chatbot that hallucinated a refund policy that didn’t exist, and the airline had to pay up. A multi-agent system in 2025 where two agents got stuck in a conversation loop for 11 days, racking up $47,000 in API costs before anyone noticed.

The lesson: don’t depend on an LLM’s probabilistic honesty. Use deterministic state checks instead.

Context Engineering: The Four Strategies in Action

The V3 architecture demonstrates all four context engineering strategies from earlier in the book:

Write. Critical facts (search query, error counts, retry states) go into dedicated state fields. Not buried in chat history. This is what enabled V3’s “self-awareness” about its own failures.

Select. The supervisor uses LangGraph’s Command feature to route to specific tools instead of running a monolithic chain. Only the relevant nodes execute. This drove the 33% token reduction on general queries.

Compress. Raw API JSON gets distilled into natural language summaries before entering conversation history. This is where the 75% token reduction on review queries comes from. It preserves the model’s “cognitive bandwidth.”

Isolate. The clarify intent node triggers a hard reset of pipeline data when it detects a topic switch. Every new search starts with a clean slate. No context contamination from previous queries.

The Sovereign Stack

Everything above leads to one final idea: sovereign agents. A sovereign agent runs fully within your infrastructure. No centralized API dependencies. No data leaving your network.

The economics are straightforward. With paid APIs, every reasoning step costs money. A single user request in a multi-agent system might trigger 50 internal loops. At per-token pricing, that adds up fast. On local hardware, the marginal cost of the 50th loop is basically zero. The V3 architecture becomes economically viable for high-volume applications.

For enterprises dealing with sensitive data (healthcare, finance), sending anything to a public API is often a non-starter. The sovereign stack keeps everything local: data ingestion via Haystack pipelines, inference via Ollama or vLLM on local GPUs, and state management via LangGraph.

The entire stress test in the book was run using open-weight models on a laptop via Ollama. No API costs. No data leaving the machine.

The Final Takeaway

Funderburk ends the book with four principles. They’re worth keeping:

  1. Tool vs orchestration pattern. Don’t ask the brain to do the heavy lifting. Use Haystack for deterministic data processing. Use LangGraph for stateful reasoning.
  2. Data-centricity over prompting. Reliability comes from better data, not just better prompts.
  3. The sovereign stack. The era of relying solely on centralized paid APIs is ending. Combine open-weight models with robust architectures.
  4. Engineering integrity. Treat LLMs as Stochastic Processing Units (SPUs). They’re powerful but unreliable components that must be wrapped in deterministic code, guardrails, and retry policies.

The whole book has been building to this point. Reliability is not a property of the model. It’s a property of the system around it. The architecture matters more than the intelligence. That’s the argument, and the data backs it up.


This is post 23 of 24 in the Building Natural Language and LLM Pipelines series.

Book series page

Previous: Agentic AI Architecture: From Monolithic Scripts to Resilient Supervisors

Next: Closing Thoughts on Building NLP and LLM Pipelines

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More