Engineering· For Head of AI / ML platform

Heterogeneous inference, explained: why no single chip is best at agents

Single-silicon inference clouds optimize for single-call benchmarks. Real agents are graphs — and the only way to win on latency and cost simultaneously is to run each step on the chip that suits it best.

Marcus Liang · Head of Inference Engineering April 15, 2026 8 min

If you take one thing from this post: the published latency benchmark for your favorite model on your favorite GPU is, at best, half the story for an agent. A real agent makes 10 to 200 calls per task — across LLMs of different sizes, embedding models, classifiers, code interpreters, web fetches, database queries and tool invocations. The optimal chip changes call by call.

Why one chip can't win

Take a credit-decisioning agent. It loads the applicant context (CPU work — JSON parsing, joins). It calls a 70B reasoning model (GPU). It validates the output against a small classifier (TPU or even CPU). It hits a few APIs (network-bound, CPU). It generates the final letter (medium GPU). And it logs everything (CPU, IO). Stuff all of that onto H100s and you pay GPU prices for orchestration. Stuff it all onto CPUs and the 70B call takes a minute. The optimal cost-latency point is heterogeneous.

The four silicon classes that matter

  • CPUs — orchestration, parsing, retrieval, tool I/O. The connective tissue of every agent graph.
  • GPUs (H100/H200/B200/MI300) — large-model generation, long context, multimodal.
  • TPUs — high-throughput batched work: embeddings, post-training, evals.
  • Non-GPU accelerators (Groq-class, custom ASICs) — latency-critical decoding paths where p95 matters more than peak throughput.

Why most clouds can't do this

Hyperscalers sell instances. To take advantage of heterogeneity at the granularity an agent needs, you'd have to provision four pools, write your own placement logic, manage four sets of failures, and pay for idle time on each. Nobody does this. So they don't get the benefits.

An inference fabric — a multi-vendor fleet behind a single scheduler, with a smart router making per-call placement decisions — is the only way to capture the heterogeneous advantage without forcing every customer to become a distributed-systems team.

"We measured a 4.7× cost reduction and a 3.2× p95 improvement just from moving orchestration off GPUs. We hadn't changed a single model."

Engineering blog, Synaptix Labs

What this means for your benchmarking

If you're evaluating inference vendors for an agent program, single-call benchmarks (TTFT, tokens-per-second on one model) will mislead you. Build a representative agent. Measure end-to-end task latency, p95 across realistic concurrency, and total cost per completed task. The vendor that wins on chat completions will not always be the vendor that wins on agents.

Related reading

More from Engineering

Bring this to your enterprise.

Talk to our team about how Synaptix would map to your stack and your roadmap.