If you take one thing from this post: the published latency benchmark for your favorite model on your favorite GPU is, at best, half the story for an agent.
Why one chip can't win
The four silicon classes that matter
- CPUs.
- GPUs (H100/H200/B200/MI300).
- TPUs.
Why most clouds can't do this
"We measured a 4.7× cost reduction and a 3.2× p95 improvement just from moving orchestration off GPUs."
What this means for your benchmarking
If you're evaluating inference vendors for an agent program, single-call benchmarks (TTFT, tokens-per-second on one model) will mislead you.