LLM Hosting Pricing

No papers today (arXiv publishes Mon-Fri).

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

⭐ 78,890 · Python

unslothai/unsloth

Web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.

⭐ 63,471 · Python

sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

⭐ 26,973 · Python

NVIDIA-NeMo/NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognitio

⭐ 17,160 · Python

GreyDGL/PentestGPT

Automated Penetration Testing Agentic Framework Powered by Large Language Models

⭐ 12,914 · Python

Inference Providers

10 platforms for AI model serving, structured for quick comparison.

OpenAI

openai.com/api/pricing/

Model API OpenAI API: no

Best for: General purpose production apps

GPT-4.1

$2.00–8.00

per 1M input/output tokens

GPT-4.1 mini

$0.40–1.60

per 1M input/output tokens

GPT-4.1 nano

$0.10–0.40

per 1M input/output tokens (cheapest)

API Pioneer

Source: pricing page · Status: error · Checked: 2026-05-03

4.0

Visit OpenAI

Anthropic

www.anthropic.com/pricing

Model API OpenAI API: no

Best for: Long-context analysis, agents, research writing

Claude Opus 4

$15.00–75.00

per 1M input/output tokens

Claude Sonnet 4.6

$3.00–15.00

per 1M input/output tokens

Claude Haiku 4.5

$1.00–5.00

per 1M input/output tokens

Long Context

Source: pricing page · Status: ok · Checked: 2026-05-03

5.0

Visit Anthropic

Together AI

www.together.ai/pricing

Serverless open models OpenAI API: yes

Best for: Cost-sensitive open-source model inference

Open models

$0.06–0.20

per 1M tokens (Gemma 4, Llama 4 Scout)

Frontier models

$0.50–4.50

per 1M tokens (Kimi K2, DeepSeek V4)

Open Source

Source: pricing page · Status: ok · Checked: 2026-05-03

4.0

Visit Together AI

Groq

groq.com/pricing

Ultra-fast inference OpenAI API: yes

Best for: Very low latency chat and real-time UX

Llama 3.1 8B

$0.05–0.08

per 1M tokens · 840 TPS

Llama 3.3 70B

$0.59–0.79

per 1M tokens · 394 TPS

Free Tier

Free

Rate-limited daily quota

Sub-second Latency

Source: pricing page · Status: ok · Checked: 2026-05-03

5.0

Visit Groq

Cohere

cohere.com/pricing

Enterprise API OpenAI API: no

Best for: Enterprise RAG, retrieval, embeddings

Command A

$2.50–10.00

per 1M input/output tokens (flagship)

Command R7B

Contact sales

Efficient small model, enterprise pricing

Enterprise

Source: pricing page · Status: ok · Checked: 2026-05-03

4.0

Visit Cohere

Replicate

replicate.com/pricing

Model marketplace OpenAI API: no

Best for: Trying many open models quickly

Hardware

$0.000025–0.0122/sec

CPU ($0.09/hr) to 8×H100 ($43.92/hr)

Official models

Per output

e.g. $0.04/image (FLUX), $0.01/1K tokens (DeepSeek R1)

Marketplace

Source: pricing page · Status: changed · Checked: 2026-05-03

3.0

Visit Replicate

AWS Bedrock

aws.amazon.com/bedrock/pricing/

Cloud platform OpenAI API: no

Best for: AWS-integrated enterprise workloads

On-demand

$0.00005–0.02

per 1K tokens (varies by model)

Flex / Batch

50% off

Batch processing or Flex tier

Priority

+75% premium

Provisioned / Priority tier for low latency

AWS Native

Source: pricing page · Status: changed · Checked: 2026-05-03

4.0

Visit AWS Bedrock

Google Vertex AI

cloud.google.com/vertex-ai/generative-ai/pricing

Cloud platform OpenAI API: no

Best for: Gemini/GCP-integrated workloads

Gemini 2.0 Flash

$0.10–0.40

per 1M input/output tokens

Gemini 2.5 Pro

$1.25–10.00

per 1M tokens (context-dependent)

GCP Native

Source: pricing page · Status: changed · Checked: 2026-05-03

4.0

Visit Google Vertex AI

Hugging Face

huggingface.co/pricing

Open model platform OpenAI API: no

Best for: Custom models, endpoints, open ML ecosystem

Serverless API

Free + Paid

PRO plan $9/month; Team $20/user/month

Dedicated Endpoints

$0.75/hour+

Custom GPU deployments (Neuron x1 to A100)

Community

Source: pricing page · Status: changed · Checked: 2026-05-03

4.0

Visit Hugging Face

Mistral AI

mistral.ai/pricing

Model API OpenAI API: yes

Best for: European AI stack and efficient model APIs

Mistral Large 3

$0.50–1.50

per 1M input/output tokens

Mistral Small 3.2

$0.10–0.30

per 1M input/output tokens

Codestral 2508

$0.30–0.90

per 1M tokens (code specialist)

European

Source: pricing page · Status: ok · Checked: 2026-05-03

4.0

Visit Mistral AI

Provider	Best Value	Most Capable	Use Case
OpenAI	GPT-3.5T ($0.0005/1K)	GPT-4 / GPT-4o family	General purpose, production apps
Anthropic	Claude Haiku tier	Claude Opus/Sonnet family	Long-context analysis, research
Together AI	Mistral 7B / small open models	Llama large models	Cost-sensitive, open-source
Groq	Llama 3 8B	Larger supported open models	Real-time, sub-second latency
Cohere	Command R	Command R+	Enterprise RAG and retrieval
Replicate	Open-source/free examples	Proprietary fine-tunes	Experimentation, prototyping
AWS Bedrock	Budget model tiers	Claude and other frontier models via API	AWS ecosystem integration
Google Vertex AI	Gemini/PaLM smaller tiers	Gemini family	GCP-integrated workloads
Hugging Face	Free/serverless inference options	Custom fine-tunes/endpoints	Custom models, community
Mistral AI	Mistral Small	Mistral Large	European/provider-diverse deployments

LLM Inference Hosting

Inference Providers

OpenAI

Anthropic

Together AI

Groq

Cohere

Replicate

AWS Bedrock

Google Vertex AI

Hugging Face

Mistral AI

Capability Comparison

Guides

Running an 8xH100/H200 GPU cluster

Serverless vs. dedicated endpoints

Benchmarking inference: TTFT, throughput and context