How is LLM API pricing calculated?

Providers charge per million tokens — separately for input (your prompt) and output (the model’s response). A token is roughly 3–4 characters of English, so 1k tokens ≈ 750 words.

Which LLM is cheapest right now?

On pure $/token, Gemini 1.5 Flash, Llama 3.1 8B and DeepSeek V3 are the cheapest production-ready models, all priced under $0.30 per million output tokens.

Are output tokens really more expensive than input?

Yes — typically 3 to 5 times more. The fastest way to lower a bill is to cap max_tokens and ask for structured JSON rather than prose.

What is prompt caching?

Most providers cache repeated prompt prefixes (system messages, retrieved context) and replay them at roughly 10% of the normal input rate. Anyone doing RAG should turn it on.

Built for builders

LLM API Pricing Calculator

Compare token costs across every major LLM provider — OpenAI, Anthropic, Google, Mistral, Meta and more. Estimate your monthly spend in seconds.

25+ models
Live calculator
Updated quarterly
No sign-up

Estimate your costs

Pick a model, dial in your traffic, and see the monthly bill update live.

Configure usage

Model

Selected: GPT-4o mini · Context 128k tokens · Fast

Input tokens per request800

Average prompt + system + retrieved context per call.

Output tokens per request400

Average response length you ask the model to produce.

Requests per day500

≈ 15,000 requests/month

Estimated monthly cost

$5.40

Based on 15,000 requests · 18.00M tokens

Input 33%Output 67%

Input tokens

12,000,000 tok @ $0.15/1M

$1.80

Output tokens

6,000,000 tok @ $0.6/1M

$3.60

Monthly total

$5.40

Annual projection

$64.80

Compare

All models side-by-side

Click any column to sort. Filter by provider, tier, or keyword. Prices are quoted per 1M tokens.

						Notes
Llama 3.1 8B (self-hosted) Estimate	Self-hosted	$0.05	$0.05	128k	Open-source	Single GPU (L4 / A10) is enough at moderate throughput.
Llama 3.1 8B Estimate	Meta	$0.06	$0.06	128k	Fast	Tiny but capable; serves at >500 tok/s on Groq.
Gemini 1.5 Flash	Google	$0.07	$0.30	1,000k	Fast	Cheapest 1M-context model; ideal for high-throughput pipelines.
Gemini 2.0 Flash	Google	$0.10	$0.40	1,000k	Fast	1M-token context for next to nothing; the best $/token deal right now.
DeepSeek Coder V2	DeepSeek	$0.14	$0.28	128k	Balanced	Built for code; competitive with GPT-4o on HumanEval at 1/15 the cost.
GPT-4o mini	OpenAI	$0.15	$0.60	128k	Fast	Cheapest production model from OpenAI; great for high-volume tasks.
Command R	Cohere	$0.15	$0.60	128k	Balanced	Optimised for retrieval and tool use at GPT-4o mini prices.
Mistral Small 3	Mistral	$0.20	$0.60	32k	Fast	Excellent latency / cost ratio; production-grade fast tier.
Claude 3 Haiku	Anthropic	$0.25	$1.25	200k	Fast	Cheapest Anthropic option; good for classification + extraction.
DeepSeek V3	DeepSeek	$0.27	$1.10	64k	Frontier	Frontier-class quality at fast-tier prices; open weights.
Qwen 2 72B (self-hosted) Estimate	Self-hosted	$0.40	$0.40	128k	Open-source	Self-hosted estimate on 2× A100 80GB; ignores utilisation overhead.
GPT-3.5 Turbo	OpenAI	$0.50	$1.50	16.385k	Fast	Older but very cheap; mostly superseded by GPT-4o mini.
Llama 3.1 70B Estimate	Meta	$0.59	$0.79	128k	Balanced	Strong open-weights alternative to Claude 3.5 Haiku / GPT-4o mini.
Claude 3.5 Haiku	Anthropic	$0.80	$4.00	200k	Fast	Fast + cheap with a huge context window; great for RAG.
Mixtral 8x22B Estimate	Mistral	$1.20	$1.20	64k	Balanced	Open-weights MoE; available on Together / Fireworks.
Gemini 1.5 Pro	Google	$1.25	$5.00	2,000k	Frontier	2M-token context — unbeatable for whole-codebase or long-doc workflows.
Mistral Large 2	Mistral	$2.00	$6.00	128k	Frontier	EU-hosted frontier; multilingual + strong on code.
Grok 2	xAI	$2.00	$10.00	128k	Frontier	Real-time data via X integration; competitive on reasoning benchmarks.
GPT-4o	OpenAI	$2.50	$10.00	128k	Frontier	Best all-rounder; strong at coding, vision, voice.
Command R+	Cohere	$2.50	$10.00	128k	Frontier	Built for RAG; native citation support, strong multilingual.
Llama 3.1 405B Estimate	Meta	$2.70	$2.70	128k	Frontier	Open weights; pricing via providers like Together / Groq / Fireworks.
OpenAI o1-mini	OpenAI	$3.00	$12.00	128k	Reasoning	Reasoning at ~1/5 the cost of o1; weaker on world knowledge.
Claude 3.5 Sonnet	Anthropic	$3.00	$15.00	200k	Frontier	Top-tier for nuanced writing, agentic tasks, and tool use.
GPT-4 Turbo	OpenAI	$10.00	$30.00	128k	Frontier	Legacy frontier; prefer GPT-4o unless you need its exact behaviour.
OpenAI o1	OpenAI	$15.00	$60.00	200k	Reasoning	Long chain-of-thought reasoning; expensive but unbeatable on hard math/code.
Claude 3 Opus	Anthropic	$15.00	$75.00	200k	Frontier	Pricey legacy flagship; pick Sonnet 3.5 unless reproducibility matters.

Prices reflect each provider's public list price. Some open-weights models are marked as estimates (varies by host: Together, Groq, Fireworks, etc.). Last refreshed June 2026.

How to reduce your LLM API costs

Six levers that consistently bring monthly LLM bills down by 30–70% in production.

Pick the right tier for the job
Use a cheap fast model (GPT-4o mini, Gemini Flash, Claude Haiku) for routing, classification, and most production tasks. Reserve frontier models for the hard 5–10% of calls.
Cache prompt prefixes
Anthropic, OpenAI, and Google all expose prompt caching. A long system prompt or RAG context replayed across requests can be served at ~10% of normal input price.
Cap the output, not the input
Output tokens are 3–5× more expensive than input. Set max_tokens, use stop sequences, and ask the model for structured JSON instead of prose whenever possible.
Route by complexity
Send simple queries to a fast model and only escalate to a frontier model when a confidence check or eval fails. A two-tier router cuts cost 40–70% in production.
Compress context before sending
Summarise long histories, dedupe retrieved chunks, and strip boilerplate from system prompts. Most teams over-fill the context window by 2–3×.
Batch & stream where you can
OpenAI and Anthropic Batch APIs run within 24h at ~50% off. Streaming doesn’t change the bill but lets you cancel mid-flight when the user navigates away.

Understanding LLM API pricing

LLM API pricing is almost always quoted per million tokens, separately for the prompt you send in (“input”) and the text the model writes back (“output”). A token is roughly 3–4 characters of English, so 1,000 tokens ≈ 750 words. The total bill for any given call is simply: input_tokens × input_rate + output_tokens × output_rate.

Three modifiers can change that base number meaningfully. Prompt caching lets you replay long system prompts or retrieved context at ≈10% of the normal input rate, which is enormous for RAG. Batch APIs (OpenAI, Anthropic) trade synchronous latency for a 50% discount on jobs that can wait up to 24 hours. And fine-tuning generally costs 1.5–3× the base inference rate, so the math only works if you’re saving on prompt length or quality at scale.

Most teams underestimate output cost. Output tokens are typically 3–5× more expensive than input tokens, and chatty models pad responses unless you cap them. The single highest-leverage change is usually setting max_tokens aggressively and asking for structured JSON instead of prose.

Frequently asked questions

Ready to scale with confident pricing?

Plug your real traffic into the calculator above, then dig into our directory of 1,000+ AI tools and frameworks to ship your next product.

Try the calculator Browse AI tools

LLM API Pricing Calculator

Configure usage

Pick the right tier for the job

Cache prompt prefixes

Cap the output, not the input

Route by complexity

Compress context before sending

Batch & stream where you can

Understanding LLM API pricing

Frequently asked questions

How is LLM API pricing calculated?

Which model is the cheapest right now?

Are output tokens really more expensive than input tokens?

What is prompt caching and when should I use it?

How does the Batch API discount work?

Is fine-tuning cheaper than prompting?

Should I self-host open-source models?

How current are these prices?

Ready to scale with confident pricing?