NVIDIA says you’re overpaying for your AI agents. Here’s the data.

B

Bhupin Baral

May 5, 2026

nvidia

A team of NVIDIA researchers just published a position paper that most AI vendors will quietly hope you never read. The argument is simple, technical, and uncomfortable for anyone selling GPT-4-class API tokens by the million:

Photo by Mariia Shalabaieva on Unsplash

Small language models — not LLMs — are the future of agentic AI.

The paper is titled *Small Language Models are the Future of Agentic AI (Belcak et al., NVIDIA Research, arXiv:2506.02153). It’s a position paper, not a benchmark study, but it’s grounded in concrete numbers, real model comparisons, and a concrete migration algorithm. If you’re a founder running an AI feature in production, the implications for your cost structure are immediate.

This post walks through what the paper actually says, why it matters for anyone running agents in production, and what the migration path from LLMs to SLMs looks like in practice.

The core claim, in one sentence

The vast majority of agent invocations are repetitive, narrowly scoped, and non-conversational. A 7B fine-tuned model is sufficient for most of them. Continuing to route those calls through a 70B–175B generalist is — to use the paper’s framing — a misallocation of computational resources.

That last phrase is doing a lot of work. The authors are not saying LLMs are useless. They’re saying you’ve been treating every call as if it needed open-domain reasoning, when most of your calls are doing one of about ten things on repeat.

The paper’s working definition of an SLM is pragmatic rather than parameter-bound: a model that can fit on a common consumer device and serve one user with low enough latency to be useful. As of 2025, the authors say they’re comfortable treating most models below ~10B parameters as SLMs.

Why this is true: agents don’t actually need general intelligence

Spend an hour reading any production agent’s logs and a pattern emerges. The same kinds of calls show up over and over:

- Parse user intent from a sentence

- Decide which of N tools to invoke

- Convert a blob of text to JSON in a specific schema

- Summarize a document to a fixed-length brief

- Generate a status message from a template

- Extract entities from a paragraph

- Validate or critique another model’s output

Each of these is a narrow, predictable task. Each gets handled by a generalist LLM today not because it needs to, but because LLM API endpoints are the default plumbing the industry has built. The paper makes this point bluntly: LLM API endpoints are designed to serve one generalist model to a million different requests. Your agent isn’t a million different requests. It’s the same five to ten prompts, repeated forever.

When you write it down that way, the economic case writes itself.

The actual numbers

This is where the paper stops feeling like opinion and starts feeling like a spreadsheet you should have built six months ago.

On cost. Serving a 7B SLM is roughly 10–30× cheaper than serving a 70B–175B LLM in latency, energy consumption, and FLOPs. That’s per-call, every call, forever. If your agent makes a million LLM calls a month, the difference is not a rounding error — it’s the difference between a viable unit economics and a fundraising story.

On capability. The paper compiles an uncomfortable list of small models that already match or beat much larger ones on the specific tasks agents care about:

- Salesforce xLAM-2–8B achieves state-of-the-art performance on tool calling, surpassing GPT-4o and Claude 3.5 despite being a fraction of their size.

- DeepSeek-R1-Distill-Qwen-7B outperforms Claude-3.5-Sonnet-1022 and GPT-4o-0513 on reasoning benchmarks.

- Microsoft Phi-2 (2.7B parameters) achieves commonsense reasoning and code generation on par with 30B models while running approximately 15× faster.

- NVIDIA Nemotron-H (2/4.8/9B hybrid Mamba-Transformer) achieves instruction-following and code-generation accuracy comparable to dense 30B LLMs at an order-of-magnitude fraction of the inference FLOPs.

- DeepMind’s RETRO-7.5B, augmented with a retrieval database, matches GPT-3 (175B) on language modeling — using 25× fewer parameters.

The pattern is consistent. On the narrow capabilities that actually matter for agents — tool calling, instruction following, structured output, code generation — the gap between small and large models has collapsed. The “you need a frontier model” assumption is already outdated, and most teams haven’t noticed yet.

The hidden tax of the LLM-default architecture

There’s a second-order argument in the paper that deserves more attention than it gets.

When your agent calls an LLM API, you’re paying for capabilities you’re not using. The model is reasoning across humanity’s accumulated text in order to answer “is this email about a refund request?” That’s a 7B-tractable classification problem dressed up in a 175B-token-priced wrapper.

Worse, you’re paying that tax on every call, and you have no leverage to bring the cost down. Your fine-tuning options are limited. Your latency is bounded by someone else’s serving infrastructure. Your data leaves your systems. Your roadmap depends on a vendor’s pricing decisions.

The paper highlights a quieter cost: agentic interactions need close behavioral alignment. Tool calls have to conform to strict schema. Outputs have to parse cleanly. A generalist LLM that occasionally hallucinates the wrong JSON format or substitutes XML for the YAML you asked for is a reliability liability — one a small model fine-tuned on your specific format will not have. Smaller, narrower models are not just cheaper; they’re often more reliable for the specific thing you need them to do.

— -

The migration path is already documented

The paper doesn’t just argue the position — it provides a step-by-step LLM-to-SLM conversion algorithm. The shape of it is straightforward:1. Log every non-conversational LLM call. Set up encrypted, role-controlled logging on your agent’s tool-call and model-call interfaces. Capture inputs, outputs, latencies. Anonymize as you go.

2. Curate and filter the data. Once you have ~10k–100k examples (the paper notes this range is sufficient for fine-tuning small models), strip PII, PHI, and any application-specific sensitive data. Paraphrase or mask where needed.

3. Cluster the prompts. Run unsupervised clustering on the logged prompts and outputs. Most production agents collapse into five to ten recurring patterns — intent recognition, structured extraction, summarization, tool routing, and a few others. Each cluster is a candidate for specialization.

4. Pick a candidate SLM per cluster. Choose based on the task type, the SLM’s relevant benchmarks, licensing, and your deployment footprint. The model lineup above is a reasonable starting point.

5. Fine-tune with PEFT. Techniques like LoRA and QLoRA make this affordable — typically a few GPU-hours per specialist. Knowledge distillation from your existing LLM can transfer nuance.

6. Route, measure, replace. Send the appropriate cluster’s traffic to the new specialist. Measure quality and cost. Iterate.

This is not a research program. It’s a quarter of engineering work for a team that already has logging infrastructure.

How much of your stack is actually replaceable?

The paper includes case studies on three popular open-source agents and estimates the percentage of LLM calls that could be reliably handled by appropriately specialized SLMs:

- MetaGPT (multi-agent software development framework): roughly **60%** of its LLM queries are SLM-replaceable. Routine code generation, boilerplate, and structured template responses don’t need a frontier model. Architectural reasoning and adaptive debugging still benefit from LLMs.

- Open Operator (workflow automation): roughly 40%. Simple command parsing, intent routing, and templated message generation are clean SLM targets. Multi-step reasoning and long-context coordination still favor LLMs.

- Cradle (general computer control via GUI interaction): roughly 70%. Repetitive GUI workflows and learned click sequences are highly specializable; dynamic adaptation and unstructured error recovery still need LLM-grade context.

Forty to seventy percent. That’s the realistic range of LLM calls in already-deployed open-source agents that could be replaced today, with the techniques the paper describes, by teams that already exist. If your agent is a typical production system, your number is somewhere in that band.

The argument against, addressed

The paper anticipates the obvious pushback and addresses it directly.

”LLMs will always be better at general language understanding.” True, in a vacuum. But agents don’t need general language understanding — they need narrow language understanding, on the specific tasks the agent’s prompts and tools have already constrained. The scaling laws that favor larger models assume constant architecture and untargeted tasks. SLMs trained with newer architectures (hybrid Mamba-Transformer, attention variants) and fine-tuned on your specific traffic break those assumptions.

”LLM inference is cheaper because of centralization and scale.” Possibly today, in some scenarios. But inference scheduling systems like NVIDIA Dynamo are explicitly built for high-throughput, low-latency SLM serving. Setup costs for inference infrastructure are trending downward. And serving an 8B model requires no parallelization across GPUs — the operational simplicity itself is a cost lever.

”The industry has already invested in LLM infrastructure, so that’s where the innovation will go.” This is the most honest counter. The paper acknowledges it explicitly. There is roughly $57B in committed cloud infrastructure betting on LLM-centric serving. Inertia is real. But the authors note that this is a barrier, not a technical limitation — and barriers fall when the economic argument becomes obvious enough.

What this means if you’re building right now

If you’re a founder or technical lead running an AI feature in production, three things follow from this paper.

Your token bill probably has a lot of fat in it. Not because you’re doing anything wrong, but because the default architecture overpays for capability you’re not using. The first audit — log your calls, cluster your prompts — costs you a sprint and tells you exactly where the fat is.

Your competitive moat is partly in your stack, not just your product. Teams that fine-tune SLMs on their own traffic will run the same product at a fraction of the cost, with stronger data control, with fewer vendor dependencies, and with the option to deploy on their own infrastructure. The teams that figure out which calls are actually small will eat the teams that don’t.

The “we’ll switch later when models get cheap” plan was already obsolete when you wrote it. The models are already cheap. The migration path is already documented. The case studies are already published. The only question is whether you do the work now or do it twelve months from now after a competitor has.

The expensive part of your AI stack isn’t the model. It’s the assumption that every task needs the biggest one.

— -

Source

Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., Lin, Y. C., & Molchanov, P. (2025). *Small Language Models are the Future of Agentic AI.* NVIDIA Research. arXiv:2506.02153. [https://arxiv.org/pdf/2506.02153](https://arxiv.org/pdf/2506.02153)

— deploy.real