Twelve months ago, the open-source frontier was clearly a step behind. Today, gpt-oss-120B holds its own on reasoning, Kimi-K2.5 stretches context to 2M tokens, Qwen3-Coder dominates code, GLM-5 leads multilingual, and DeepSeek V3.2 set a new bar on cost-per-quality. The open frontier is no longer the catch-up frontier.
What changed
- Architectures matured. Mixture-of-experts, sparse attention and speculative decoding moved from papers to default choices.
- Post-training is open. RLHF, DPO, RLAIF and on-policy distillation are no longer secret sauce.
- Synthetic data scaled. The data moat shrank.
- Operational tooling caught up. vLLM, TensorRT-LLM and SGLang made open-model serving genuinely production-grade.
Why operations is the remaining moat
Open weights aren't a service. Running them at production speed, keeping the catalog current with releases, handling fine-tuning safely, providing OpenAI-compatible APIs and meeting enterprise SLAs — that's where most teams stall. TokenFactory exists to close that gap.
What we ship
- 1.Inference service — OpenAI-compatible API across the entire open frontier, sub-second latency, 99.9% uptime.
- 2.Batch inference — async at up to 50% lower cost. Built for evals, embeddings, document pipelines.
- 3.Post-training — SFT, LoRA, DPO and RL on your data, no GPU management.
- 4.Dedicated deployments — reserved capacity with 99.99% SLA, private networking, custom regions.
"We swapped four vendors and a homegrown serving stack for one TokenFactory endpoint. Latency improved, costs dropped, and our team got back to building."
What to use when
Reasoning-heavy: gpt-oss-120B or DeepSeek V3.2. Long-context retrieval: Kimi-K2.5. Code generation and review: Qwen3-Coder. Multilingual customer-facing: GLM-5. Agent loops: MiniMax M2.1 or Nemotron 3 Super for tool use. Pick by workload, route by policy, and let TokenFactory handle the rest.