Leading open-source models, served at record speed.
One API for every open model — running on the same agent-native inference cloud that powers Synaptix. Best-in-class latency and throughput, pay-as-you-go.
Four ways to run open models in production.
From a single API call to dedicated multi-region deployments — TokenFactory meets you wherever you are.
Inference service
Access and run powerful open-source AI models through a single OpenAI-compatible API. Sub-second latency, 99.9% uptime, pay per token.
Batch inference
Process millions of requests asynchronously at up to 50% lower cost. Ideal for evaluations, embeddings, document pipelines and offline workflows.
Post-training service
Fine-tune, distill and align open models on your proprietary data. LoRA, full SFT, DPO and RL — without managing GPUs.
Enterprise-grade inference
Deploy and scale models on dedicated infrastructure with guaranteed uptime, private networking and custom SLAs.
A catalog that ships with the frontier.
New open releases are evaluated, optimized and added within days — same API, no migration.
OpenAI-compatible. Drop-in in 3 lines.
from openai import OpenAI
client = OpenAI(
base_url="https://api.tokenfactory.ai/v1",
api_key="tf_live_…",
)
resp = client.chat.completions.create(
model="gpt-oss-120b",
messages=[{"role": "user", "content": "Summarize this report."}],
)
print(resp.choices[0].message.content)curl https://api.tokenfactory.ai/v1/chat/completions \
-H "Authorization: Bearer $TF_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3.2",
"messages": [{"role":"user","content":"Hello"}]
}'Pay only for the tokens you use.
Start with a free API key. Pay only for tokens consumed.
- OpenAI-compatible API
- All open models
- Batch + real-time
- Community support
Higher rate limits, priority routing and analytics for production workloads.
- 10× higher rate limits
- Priority queue
- Usage analytics & budgets
- Email support · 24h SLA
Dedicated capacity, private networking and a 99.99% uptime SLA.
- Dedicated GPUs · reserved
- VPC peering · BYOC
- Fine-tuning included
- 99.99% SLA · 24/7 support
Token prices vary by model. Batch inference is up to 50% lower than real-time.
Go deeper on TokenFactory
Benchmarks, technical posts and a printable product brief.
TTFT, throughput and p95 vs. Bedrock, Vertex and Together
How TokenFactory compares across the open-model frontier.
TokenFactory and the open-model frontier
gpt-oss, Kimi-K2.5, Qwen3-Coder and GLM-5 in production.
Benchmarking agent latency: TTFT, p95 and what single-model benchmarks miss
Why agent workloads need a different benchmark methodology.
TokenFactory — product brief
Architecture, model catalog, pricing and SLA in one PDF.
Heterogeneous inference — product brief
Routing across NVIDIA, AMD, Cerebras, Groq and TPU.
Start shipping with open models today.
Spin up an API key in minutes, or talk to us about dedicated capacity and fine-tuning.