Hardware Limits, NVIDIA NIMs, Edge Deployment, and Why LLMs Still Struggle

Chapter 9 of Laura Funderburk’s book takes a step back from building things and looks forward. What’s coming next for NLP and LLM systems? Where are the bottlenecks? What’s changing?

This first half covers two big topics: the hardware problem that limits how we deploy models, and the four fundamental limitations of LLMs that drive everything else in the chapter.

The Hardware Problem Is Real

Here’s the thing about LLMs that people tend to forget. They run on actual hardware. And that hardware has real constraints.

Funderburk breaks it down into three areas:

Computational demands. Training LLMs means pushing billions of parameters through massive datasets. Regular CPUs can’t handle it. You need GPUs or TPUs, hardware designed for parallel processing. This isn’t optional. It’s table stakes.

Memory constraints. Large models can be bigger than the memory on standard GPUs. During both training and inference, you hit a wall. You either need expensive high-memory GPUs or you distribute the model across multiple machines. Neither option is cheap.

Energy consumption. Running these models takes a lot of power. Like, a concerning amount. This creates both environmental and financial pressure, especially at scale.

NVIDIA’s Answer: NIMs and the Ecosystem

Funderburk points to NVIDIA’s ecosystem as the main path forward for dealing with hardware limitations. Three pieces matter here:

NVIDIA AI Foundational Models are pretrained models that run on NVIDIA’s accelerated infrastructure. Think of them as a curated catalog. Community models and NVIDIA-built models, all tuned to run well on their hardware.

NVIDIA NeMo is a framework for training and customizing generative AI models. If you need to fine-tune or build your own model, this is the tool.

NVIDIA NIMs (Inference Microservices) are the really interesting part for production work. These are prebuilt, containerized microservices designed specifically for running model inference.

Here’s why NIMs matter. They use TensorRT under the hood, which applies techniques like quantization, layer fusion, and kernel tuning to squeeze maximum performance out of NVIDIA GPUs. From edge devices to data center clusters.

But here’s what connects it to the rest of the book. NIMs follow the exact same deployment pattern from Chapters 7 and 8. Remember packaging Haystack pipelines as microservices using Hayhooks? NIMs are the next logical step: deploying the foundation models themselves as high-performance, callable microservices.

In a future agentic system, the orchestrator wouldn’t care whether it’s calling a Haystack tool for RAG or a NIM for raw generation. They’re both just endpoints.

Haystack already has NVIDIA integration built in, with components like NvidiaTextEmbedder, NvidiaDocumentEmbedder, NvidiaGenerator, and NvidiaRanker. You can drop these into your existing pipelines.

Running Models Locally

Not everyone can or wants to use cloud APIs. Some companies have data sovereignty requirements. Others just want to avoid the costs. Funderburk covers the local deployment path too.

The main tools: Ollama, vLLM, and LM Studio. All of them let you pull and run tuned models on your own hardware with minimal setup. And because Haystack and LangGraph have integrations for these tools, you can swap out a cloud-based LLM generator component for a local one without rewriting your pipeline.

LangChain’s local deep researcher is a good example. It’s a full agentic system that can gather information, organize it, and write reports, all running locally.

Funderburk also highlights DeepSeek-R1 as a milestone. The DeepSeek team released models in early 2025 that matched OpenAI’s o1 on reasoning tasks at a fraction of the cost. They did it through innovations like Group Relative Policy Optimization (GRPO), which eliminates the need for a large critic model during reinforcement learning, and knowledge distillation, which compresses a big model’s capabilities into a smaller one. The business case is clear: cheaper models, simpler infrastructure, shorter development cycles.

The Four Big Limitations of LLMs

With hardware context out of the way, Funderburk lays out why LLMs still aren’t good enough on their own. She identifies four problems, and each one maps directly to a future trend or protocol covered later in the chapter.

The truthfulness problem. LLMs are probabilistic systems, not factual databases. They predict the next word based on patterns, not facts. This is why they hallucinate. RAG (from Chapters 4-6) was designed to fix this by grounding responses in real documents. But static RAG is only a partial solution. The next step is Agentic Context Engineering (ACE), which creates dynamic, self-correcting context that continuously ensures factual grounding.

The context problem. LLMs have token limits. They can’t process infinite input, they struggle with long conversations, and they have no real long-term memory between sessions. This is why agentic architectures from Chapter 8 matter. Instead of cramming everything into one prompt, an orchestrator like LangGraph breaks the problem into steps, calling tools sequentially and managing state. ACE takes this further by treating context as an “evolving playbook” rather than a fixed prompt.

The black box problem. You can’t see how an LLM reached its answer. The internal reasoning is opaque. This is a huge issue in high-stakes domains like law, medicine, or finance. The answer coming down the pipeline is the Agent-to-Agent (A2A) protocol. A2A’s “Structured Task Execution” creates an auditable trace: which agent did what, in what order, with what result. That’s far more transparent than a single monolithic model.

The integration problem. Even the best LLM is useless if it can’t access the data it needs. Right now, data is trapped in silos and legacy systems. Every new tool or API needs custom connector code. This is exactly why Anthropic created the Model Context Protocol (MCP). A universal, open standard to replace one-off integrations with a plug-and-play ecosystem.

Why This Matters

Each limitation points to a specific solution. Truthfulness leads to ACE. Context leads to agentic orchestration. Black box leads to A2A. Integration leads to MCP.

Funderburk sets this up nicely. She doesn’t just list problems. She connects each problem to the tool or protocol that addresses it. The rest of Chapter 9 walks through MCP, A2A, and ACE in detail, which we’ll cover in the next post.

The takeaway from this first half: the hardware is getting better (NIMs, local deployment, cheaper models), but the fundamental limitations of LLMs aren’t going away. They’re being managed through better architecture, better protocols, and better context management. That’s where the field is heading.


This is post 20 of 24 in the Building Natural Language and LLM Pipelines series.

Previous: Chapter 8: Hands-On Projects - Part 3

Next: Chapter 9: Future Trends - Part 2

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More