Benchmarking agent latency: TTFT, p95, and why your single-model benchmarks lie

Open the marketing page of any inference vendor and you'll find some version of the same chart: tokens-per-second on Llama-70B, TTFT on a single chat completion. These numbers are real, and they are almost completely useless for predicting how an agent will behave in production.

What single-call benchmarks miss

Concurrency. A single-stream benchmark hides queue effects. Real agents run 10–100 in parallel.
Multi-call sequencing. Agents make tens of calls per task. Each handoff has overhead the benchmark ignores.
Tool use and retrieval. The slow steps in most agents are not the model — they're the tools and retrieval calls around it.
Tail behavior. p50 looks fine. p95 and p99 are where users feel the system. Most benchmarks publish averages.
Mixed workload. Real fleets have small models, large models, embeddings and reranking. A 70B benchmark says nothing about that.

A better methodology

We propose four metrics that, together, predict agent UX:

1.End-to-end task latency (p50/p95/p99) — wall clock from user request to final answer, including tools.
2.Steps-per-second — how many model+tool steps the system can sustain under realistic concurrency.
3.Cost per completed task — total tokens × prices + tool invocation costs, attributed per task.
4.Quality-conditioned latency — latency only counted on tasks that passed your eval suite. Slow correct beats fast wrong.

Build a representative harness

Pick three to five tasks that look like your real workload — not synthetic Q&A. Wire them up with the actual tools they'll use in production. Run at three concurrency levels (1, 16, 64) and record all four metrics. Re-run weekly. The number that moves first is your bottleneck.

"We rewrote our vendor scorecard around end-to-end task latency at p95 and the rankings completely changed."
— Eng Director, Fortune-100 retailer

Why heterogeneous fabrics win these benchmarks

Once you measure end-to-end, the value of moving orchestration off GPUs and decoding onto specialized accelerators becomes obvious. Single-silicon clouds can't beat a fabric that places each step on the right hardware. The benchmark just has to be designed to see it.

Benchmarking agent latency: TTFT, p95, and why your single-model benchmarks lie

What single-call benchmarks miss

A better methodology

Build a representative harness

Why heterogeneous fabrics win these benchmarks

More from Engineering

Heterogeneous inference, explained: why no single chip is best at agents

Bring this to your enterprise.