LLMCloud · Open-model inference
Every open model. One API. Built for production.
Serving the leading open-source LLMs — gpt-oss, Kimi, Qwen, GLM, DeepSeek, Llama. Pay-as-you-go, batch, fine-tune or dedicated.
40+
Open models
2M
Max context
99.99%
Enterprise SLA
−50%
Batch savings
Services
Four ways to run open models in production.
Real-time inference
OpenAI-compatible API. Sub-second latency, 99.9% uptime, pay per token.
Batch inference
Async processing at up to 50% lower cost — evals, embeddings, pipelines.
Fine-tuning
LoRA, SFT, DPO and RL on your data. Bring a dataset, get a model.
Dedicated deployments
Reserved GPUs, private networking, 99.99% SLA.
Models
A catalog that ships with the frontier.
New open releases evaluated, optimized and added within days. Same API.
gpt-oss-120B
General · Reasoning
128K context
Kimi-K2.5
Long context · Agents
2M context
Qwen3-Coder-480B
Code · MoE
256K context
GLM-5
Multilingual · Reasoning
128K context
DeepSeek V3.2
Reasoning · MoE
128K context
MiniMax M2.1
Multimodal · Agents
1M context
Llama 4 405B
General purpose
256K context
Mistral Large 3
European · Tooling
128K context
Nemotron 3 Super
Enterprise · Reasoning
128K context
Quickstart
OpenAI-compatible. Drop-in in 3 lines.
Pythonapi.llmcloud.ai
from openai import OpenAI
client = OpenAI(
base_url="https://api.llmcloud.ai/v1",
api_key="lc_live_…",
)
resp = client.chat.completions.create(
model="gpt-oss-120b",
messages=[{"role": "user", "content": "Summarize this report."}],
)
print(resp.choices[0].message.content)cURLapi.llmcloud.ai
curl https://api.llmcloud.ai/v1/chat/completions \
-H "Authorization: Bearer $LC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3.2",
"messages": [{"role":"user","content":"Hello"}]
}'Pricing
Pay only for the tokens you use.
Free
$0to start
Free API key. Pay only for tokens consumed.
- OpenAI-compatible API
- All open models
- Batch + real-time
- Community support
Scale
$500/ month min
Higher rate limits and analytics for production.
- 10× rate limits
- Priority queue
- Usage analytics
- 24h email SLA
Enterprise
Customannual
Dedicated capacity, VPC peering, fine-tuning included.
- Reserved GPUs
- VPC / BYOC
- Fine-tuning included
- 99.99% SLA · 24/7
Token prices vary by model. Batch inference up to 50% lower than real-time.