Engineering· For ML engineer

Benchmarking agent latency: TTFT, p95, and why your single-model benchmarks lie

Most published inference benchmarks measure the wrong thing for agents.

Tomás Aguilar · Performance Engineering Lead April 1, 2026 9 min

Open the marketing page of any inference vendor and you'll find some version of the same chart: tokens-per-second on Llama-70B, TTFT on a single chat completion.

What single-call benchmarks miss

  • Concurrency.
  • Multi-call sequencing.
  • Tool use and retrieval.

A better methodology

  1. 1.End-to-end task latency (p50/p95/p99).
  2. 2.Steps-per-second.
  3. 3.Cost per completed task.

Build a representative harness

Pick three to five tasks that look like your real workload — not synthetic Q&A.

"We rewrote our vendor scorecard around task latency at p95 and the rankings completely changed."

Eng Director, Fortune-100 retailer

Why heterogeneous fabrics win these benchmarks

Related reading

More from Engineering

Bring this to your enterprise.

Talk to our team about how Synaptix would map to your stack and your roadmap.