If you take one thing from this post: the published latency benchmark for your favorite model on your favorite GPU is, at best, half the story for an agent. A real agent makes 10 to 200 calls per task — across LLMs of different sizes, embedding models, classifiers, code interpreters, web fetches, database queries and tool invocations. The optimal chip changes call by call.
Why one chip can't win
Take a credit-decisioning agent. It loads the applicant context (CPU work — JSON parsing, joins). It calls a 70B reasoning model (GPU). It validates the output against a small classifier (TPU or even CPU). It hits a few APIs (network-bound, CPU). It generates the final letter (medium GPU). And it logs everything (CPU, IO). Stuff all of that onto H100s and you pay GPU prices for orchestration. Stuff it all onto CPUs and the 70B call takes a minute. The optimal cost-latency point is heterogeneous.
The four silicon classes that matter
- CPUs — orchestration, parsing, retrieval, tool I/O. The connective tissue of every agent graph.
- GPUs (H100/H200/B200/MI300) — large-model generation, long context, multimodal.
- TPUs — high-throughput batched work: embeddings, post-training, evals.
- Non-GPU accelerators (Groq-class, custom ASICs) — latency-critical decoding paths where p95 matters more than peak throughput.
Why most clouds can't do this
Hyperscalers sell instances. To take advantage of heterogeneity at the granularity an agent needs, you'd have to provision four pools, write your own placement logic, manage four sets of failures, and pay for idle time on each. Nobody does this. So they don't get the benefits.
An inference fabric — a multi-vendor fleet behind a single scheduler, with a smart router making per-call placement decisions — is the only way to capture the heterogeneous advantage without forcing every customer to become a distributed-systems team.
"We measured a 4.7× cost reduction and a 3.2× p95 improvement just from moving orchestration off GPUs. We hadn't changed a single model."
What this means for your benchmarking
If you're evaluating inference vendors for an agent program, single-call benchmarks (TTFT, tokens-per-second on one model) will mislead you. Build a representative agent. Measure end-to-end task latency, p95 across realistic concurrency, and total cost per completed task. The vendor that wins on chat completions will not always be the vendor that wins on agents.