LLMCloud · Open-model inference

Every open model. One API. Built for production.

Serving the leading open-source LLMs — gpt-oss, Kimi, Qwen, GLM, DeepSeek, Llama. Pay-as-you-go, batch, fine-tune or dedicated.

40+
Open models
2M
Max context
99.99%
Enterprise SLA
−50%
Batch savings
Services

Four ways to run open models in production.

Real-time inference

OpenAI-compatible API. Sub-second latency, 99.9% uptime, pay per token.

Batch inference

Async processing at up to 50% lower cost — evals, embeddings, pipelines.

Fine-tuning

LoRA, SFT, DPO and RL on your data. Bring a dataset, get a model.

Dedicated deployments

Reserved GPUs, private networking, 99.99% SLA.

Models

A catalog that ships with the frontier.

New open releases evaluated, optimized and added within days. Same API.

gpt-oss-120B
General · Reasoning
128K context
Kimi-K2.5
Long context · Agents
2M context
Qwen3-Coder-480B
Code · MoE
256K context
GLM-5
Multilingual · Reasoning
128K context
DeepSeek V3.2
Reasoning · MoE
128K context
MiniMax M2.1
Multimodal · Agents
1M context
Llama 4 405B
General purpose
256K context
Mistral Large 3
European · Tooling
128K context
Nemotron 3 Super
Enterprise · Reasoning
128K context
Quickstart

OpenAI-compatible. Drop-in in 3 lines.

Pythonapi.llmcloud.ai
from openai import OpenAI

client = OpenAI(
 base_url="https://api.llmcloud.ai/v1",
 api_key="lc_live_…",
)

resp = client.chat.completions.create(
 model="gpt-oss-120b",
 messages=[{"role": "user", "content": "Summarize this report."}],
)
print(resp.choices[0].message.content)
cURLapi.llmcloud.ai
curl https://api.llmcloud.ai/v1/chat/completions \
 -H "Authorization: Bearer $LC_API_KEY" \
 -H "Content-Type: application/json" \
 -d '{
 "model": "deepseek-v3.2",
 "messages": [{"role":"user","content":"Hello"}]
 }'
Pricing

Pay only for the tokens you use.

Free
$0to start

Free API key. Pay only for tokens consumed.

  • OpenAI-compatible API
  • All open models
  • Batch + real-time
  • Community support
Get an API key
Scale
$500/ month min

Higher rate limits and analytics for production.

  • 10× rate limits
  • Priority queue
  • Usage analytics
  • 24h email SLA
Start scaling
Enterprise
Customannual

Dedicated capacity, VPC peering, fine-tuning included.

  • Reserved GPUs
  • VPC / BYOC
  • Fine-tuning included
  • 99.99% SLA · 24/7
Talk to sales

Token prices vary by model. Batch inference up to 50% lower than real-time.