Open the marketing page of any inference vendor and you'll find some version of the same chart: tokens-per-second on Llama-70B, TTFT on a single chat completion. These numbers are real, and they are almost completely useless for predicting how an agent will behave in production.
What single-call benchmarks miss
- Concurrency. A single-stream benchmark hides queue effects. Real agents run 10–100 in parallel.
- Multi-call sequencing. Agents make tens of calls per task. Each handoff has overhead the benchmark ignores.
- Tool use and retrieval. The slow steps in most agents are not the model — they're the tools and retrieval calls around it.
- Tail behavior. p50 looks fine. p95 and p99 are where users feel the system. Most benchmarks publish averages.
- Mixed workload. Real fleets have small models, large models, embeddings and reranking. A 70B benchmark says nothing about that.
A better methodology
We propose four metrics that, together, predict agent UX:
- 1.End-to-end task latency (p50/p95/p99) — wall clock from user request to final answer, including tools.
- 2.Steps-per-second — how many model+tool steps the system can sustain under realistic concurrency.
- 3.Cost per completed task — total tokens × prices + tool invocation costs, attributed per task.
- 4.Quality-conditioned latency — latency only counted on tasks that passed your eval suite. Slow correct beats fast wrong.
Build a representative harness
Pick three to five tasks that look like your real workload — not synthetic Q&A. Wire them up with the actual tools they'll use in production. Run at three concurrency levels (1, 16, 64) and record all four metrics. Re-run weekly. The number that moves first is your bottleneck.
"We rewrote our vendor scorecard around end-to-end task latency at p95 and the rankings completely changed."
Why heterogeneous fabrics win these benchmarks
Once you measure end-to-end, the value of moving orchestration off GPUs and decoding onto specialized accelerators becomes obvious. Single-silicon clouds can't beat a fabric that places each step on the right hardware. The benchmark just has to be designed to see it.