LLM Inference Hosting

Compare AI model serving platforms by price, strengths, API style, and best use case. Updated every 6 hours.

No papers today (arXiv publishes Mon-Fri).

Inference Providers

10 platforms for AI model serving, structured for quick comparison.

Together AI

www.together.ai/pricing
Serverless open models OpenAI API: yes

Best for: Cost-sensitive open-source model inference

Open models
$0.06–0.20
per 1M tokens (Gemma 4, Llama 4 Scout)
Frontier models
$0.50–4.50
per 1M tokens (Kimi K2, DeepSeek V4)
Open Source

Source: pricing page · Status: ok · Checked: 2026-05-03

Groq

groq.com/pricing
Ultra-fast inference OpenAI API: yes

Best for: Very low latency chat and real-time UX

Llama 3.1 8B
$0.05–0.08
per 1M tokens · 840 TPS
Llama 3.3 70B
$0.59–0.79
per 1M tokens · 394 TPS
Free Tier
Free
Rate-limited daily quota
Sub-second Latency

Source: pricing page · Status: ok · Checked: 2026-05-03

Cohere

cohere.com/pricing
Enterprise API OpenAI API: no

Best for: Enterprise RAG, retrieval, embeddings

Command A
$2.50–10.00
per 1M input/output tokens (flagship)
Command R7B
Contact sales
Efficient small model, enterprise pricing
Enterprise

Source: pricing page · Status: ok · Checked: 2026-05-03

Replicate

replicate.com/pricing
Model marketplace OpenAI API: no

Best for: Trying many open models quickly

Hardware
$0.000025–0.0122/sec
CPU ($0.09/hr) to 8×H100 ($43.92/hr)
Official models
Per output
e.g. $0.04/image (FLUX), $0.01/1K tokens (DeepSeek R1)
Marketplace

Source: pricing page · Status: changed · Checked: 2026-05-03

AWS Bedrock

aws.amazon.com/bedrock/pricing/
Cloud platform OpenAI API: no

Best for: AWS-integrated enterprise workloads

On-demand
$0.00005–0.02
per 1K tokens (varies by model)
Flex / Batch
50% off
Batch processing or Flex tier
Priority
+75% premium
Provisioned / Priority tier for low latency
AWS Native

Source: pricing page · Status: changed · Checked: 2026-05-03

Google Vertex AI

cloud.google.com/vertex-ai/generative-ai/pricing
Cloud platform OpenAI API: no

Best for: Gemini/GCP-integrated workloads

Gemini 2.0 Flash
$0.10–0.40
per 1M input/output tokens
Gemini 2.5 Pro
$1.25–10.00
per 1M tokens (context-dependent)
GCP Native

Source: pricing page · Status: changed · Checked: 2026-05-03

Hugging Face

huggingface.co/pricing
Open model platform OpenAI API: no

Best for: Custom models, endpoints, open ML ecosystem

Serverless API
Free + Paid
PRO plan $9/month; Team $20/user/month
Dedicated Endpoints
$0.75/hour+
Custom GPU deployments (Neuron x1 to A100)
Community

Source: pricing page · Status: changed · Checked: 2026-05-03

Mistral AI

mistral.ai/pricing
Model API OpenAI API: yes

Best for: European AI stack and efficient model APIs

Mistral Large 3
$0.50–1.50
per 1M input/output tokens
Mistral Small 3.2
$0.10–0.30
per 1M input/output tokens
Codestral 2508
$0.30–0.90
per 1M tokens (code specialist)
European

Source: pricing page · Status: ok · Checked: 2026-05-03

Capability Comparison

ProviderBest ValueMost CapableUse Case
OpenAI GPT-3.5T ($0.0005/1K) GPT-4 / GPT-4o family General purpose, production apps
Anthropic Claude Haiku tier Claude Opus/Sonnet family Long-context analysis, research
Together AI Mistral 7B / small open models Llama large models Cost-sensitive, open-source
Groq Llama 3 8B Larger supported open models Real-time, sub-second latency
Cohere Command R Command R+ Enterprise RAG and retrieval
Replicate Open-source/free examples Proprietary fine-tunes Experimentation, prototyping
AWS Bedrock Budget model tiers Claude and other frontier models via API AWS ecosystem integration
Google Vertex AI Gemini/PaLM smaller tiers Gemini family GCP-integrated workloads
Hugging Face Free/serverless inference options Custom fine-tunes/endpoints Custom models, community
Mistral AI Mistral Small Mistral Large European/provider-diverse deployments

Guides

Practical walkthroughs for running and benchmarking LLM infrastructure.

Training

Running an 8xH100/H200 GPU cluster

How to provision and use a multi-GPU cluster for LLM fine-tuning -- from NCCL setup to distributed training strategies.

Coming soon →
Architecture

Serverless vs. dedicated endpoints

When to pay per-token on a shared API vs. reserving dedicated hardware. Covers cold-start latency, throughput guarantees, and cost crossover points.

Coming soon →
Benchmarking

Benchmarking inference: TTFT, throughput and context

How to measure time-to-first-token, tokens-per-second, and latency under load. Includes a reference test script and interpretation guide.

Coming soon →