Vector Stores, Agentic Memory, and the Economics of LLMs - Chapter 2 Part 3

Parts 1 and 2 of this chapter covered transformer architecture, the SLM/RLM split, context engineering strategies, and the Haystack + LangGraph hybrid architecture. Now Funderburk closes the chapter with two topics that every developer building LLM applications needs to understand: vector stores and the economics of inference.

Vector Stores: Not Just a Database

A vector store (or vector database) is a specialized database that stores data as mathematical vectors. Unlike regular databases that match on exact keywords, vector stores find things by semantic similarity. You give it a question, it converts the question to a vector, and finds the stored vectors that are closest in meaning.

These vectors are called embeddings. An embedding model converts text (or images, or audio) into a numerical representation that captures its meaning. Similar concepts end up near each other in this vector space.

By 2025, vector stores are not niche technology anymore. Gartner predicted that by 2026, over 30% of enterprises will use them. They are the engine behind modern semantic search.

The RAG Ingestion Pipeline

Before you can search a vector store, you need to fill it with data. This ingestion pipeline is where most RAG implementations fail. Funderburk walks through each step:

  1. Partitioning: Take raw documents (PDFs, HTML, text files) and extract clean text.
  2. Chunking: Break the text into pieces. This is the most critical decision. Too large and you get noisy context that pollutes the model’s window. Too small and each chunk lacks enough context to be useful.
  3. Embedding: Run each chunk through an embedding model to get its vector representation.
  4. Indexing: Store the vectors and their original text in the vector database.

Then for retrieval:

  1. Embed the query: Convert the user’s question into a vector using the same embedding model.
  2. Retrieve: Find the closest vectors in the database.
  3. Augment: Pass the retrieved chunks plus the original question to an LLM for a grounded answer.

Here’s the thing about chunking that Funderburk emphasizes. Your chunk size and embedding model choice are permanent architectural decisions. They define the “semantic resolution” of your entire knowledge base. All future retrieval is constrained by this joint decision. Get it wrong and no amount of prompt tuning will fix it.

Hybrid Search: Combining Dense and Sparse

Pure semantic search has a well-known weakness. It cannot find exact keywords reliably. If someone searches for a specific product ID and the surrounding text is not a strong semantic match, the embedding search will miss it.

Traditional keyword search (BM25/TF-IDF) has the opposite problem. Great at exact matches, zero understanding of meaning.

The solution is hybrid search. Run both queries in parallel, one against the dense vector index and one against the sparse keyword index, then fuse the results. The most common fusion algorithm is Reciprocal Rank Fusion (RRF), which re-ranks combined results based on their positions in the original lists.

User query: "Show me order #ABC-12345"

Dense search (semantic) --> maybe finds relevant context, maybe not
Sparse search (BM25)   --> finds exact match on "ABC-12345"

RRF fusion --> combines both result sets into one ranked list

Haystack 2.0 provides components for both search types and handles the full data pipeline (ingestion, chunking, embedding, indexing, retrieval) as two deployable tools: an indexing pipeline and a retrieval pipeline. Each can be deployed as an independent microservice through Hayhooks.

From RAG Index to Agentic Memory

Here’s where it gets really interesting. In 2023, vector stores were passive and read-only. You loaded documents once, and the model read from them. Write once, read many.

In 2025, agents are stateful. They run in loops and generate new data every cycle. But LLMs are inherently stateless and their context windows are finite. So how does an agent remember anything?

The answer is to use the vector store as a read-write memory layer. Instead of just retrieving external facts, the agent retrieves its own past experiences. Conversation history, learned preferences, and task summaries become searchable vectors.

Funderburk describes three memory layers:

  • Short-term memory: The immediate context window.
  • Working memory: A scratchpad for intermediate thoughts.
  • Long-term memory: A persistent vector store the agent writes to and reads from.

But here’s the problem. If the agent writes constantly without any curation, the memory fills with redundant, conflicting, and trivial junk. The select strategy starts returning garbage, and you get the exact context pollution problem you were trying to avoid.

Memory Consolidation

The book describes an advanced pattern from AWS (Amazon’s AgentCore architecture). Instead of blindly writing every new memory, the system runs a consolidation process:

  1. A new memory is generated.
  2. Before committing it, the system queries the vector store for the most similar existing memories.
  3. Both old and new memories go to an LLM with a consolidation prompt.
  4. The LLM decides: ADD (new information), UPDATE (refine existing memory), or NO-OP (redundant, skip it).
  5. Only then does the write actually happen.

This is recursive. The system uses an LLM and RAG-like retrieval to manage the contents of its own memory. It is a self-curating loop, and it is a hallmark of 2025 agent design.

The Economics: Inference Is the Real Cost

The last section of Chapter 2 tackles money. And there is a common misconception Funderburk wants to correct: the primary cost of LLMs is NOT training. Training is a massive one-time capital expenditure done by AI companies like OpenAI, Google, and Anthropic. For everyone else building applications, the real cost is inference, the ongoing operational expense of actually using the model.

The analogy from the book: training is everything that goes into building a car. Inference is the cost of gasoline.

The 2025 market has an interesting paradox. Training costs keep going up, but inference costs are in freefall due to the API price war between incumbents and new startups. Funderburk calls this “The Great AI Cost Compression of 2025.”

But cheap API calls can be a trap. For system architects, what matters is Total Cost of Ownership (TCO):

  • For API models: TCO is straightforward. Input tokens + output tokens. But costs scale linearly and can become unpredictable at high volume.
  • For self-hosted models: Higher upfront cost but steady and predictable long-term expenses.

There is a crossover point where cumulative API costs exceed the cost of running your own infrastructure. SLMs play a big role here because a 7B parameter model is 10-30x cheaper to serve than a 70-175B parameter model.

The final decision matrix comes down to four pillars: capability, cost, latency, and privacy. Self-hosted wins for regulated data, low latency, and high volume. APIs win for prototyping, low volume, and when you do not want to manage infrastructure.

Chapter 2 Wrap-Up

This was a dense chapter. Funderburk covered the transformer architecture, the SLM/RLM split, context engineering (write, select, compress, isolate), the LangGraph + Haystack hybrid architecture, vector stores as both RAG foundation and agentic memory, and the economics of inference.

The big ideas to carry forward:

  1. The one-size-fits-all model era is over. Choose models based on cost, latency, and reasoning depth.
  2. Prompt engineering has evolved into context engineering. Manage the full information flow, not just the prompt text.
  3. Use each framework for what it does best. LangGraph for orchestration, Haystack for data pipelines.
  4. Vector stores are no longer passive. They are active read-write memory for agents.
  5. Inference cost, not training cost, is what determines if your application is economically viable.

Next up: Chapter 3, where we get hands-on with Haystack and start building the tool layer.


This is post 6 of 24 in the Building Natural Language and LLM Pipelines series.

Previous: Chapter 2: Large Language Models - Part 2

Next: Chapter 3: Introduction to Haystack - Part 1

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More