Open the marketing page of any inference vendor and you'll find some version of the same chart: tokens-per-second on Llama-70B, TTFT on a single chat completion.
What single-call benchmarks miss
- Concurrency.
- Multi-call sequencing.
- Tool use and retrieval.
A better methodology
- 1.End-to-end task latency (p50/p95/p99).
- 2.Steps-per-second.
- 3.Cost per completed task.
Build a representative harness
Pick three to five tasks that look like your real workload — not synthetic Q&A.
"We rewrote our vendor scorecard around task latency at p95 and the rankings completely changed."