LLM API Pricing Calculator
Compare token costs across every major LLM provider — OpenAI, Anthropic, Google, Mistral, Meta and more. Estimate your monthly spend in seconds.
- 25+ models
- Live calculator
- Updated quarterly
- No sign-up
Estimate your costs
Pick a model, dial in your traffic, and see the monthly bill update live.
Configure usage
Selected: GPT-4o mini · Context 128k tokens · Fast
Average prompt + system + retrieved context per call.
Average response length you ask the model to produce.
≈ 15,000 requests/month
Estimated monthly cost
$5.40
Based on 15,000 requests · 18.00M tokens
Input tokens
12,000,000 tok @ $0.15/1M
Output tokens
6,000,000 tok @ $0.6/1M
Monthly total
Annual projection
Compare
All models side-by-side
Click any column to sort. Filter by provider, tier, or keyword. Prices are quoted per 1M tokens.
| Notes | ||||||
|---|---|---|---|---|---|---|
Llama 3.1 8B (self-hosted) Estimate | Self-hosted | $0.05 | $0.05 | 128k | Open-source | Single GPU (L4 / A10) is enough at moderate throughput. |
Llama 3.1 8B Estimate | Meta | $0.06 | $0.06 | 128k | Fast | Tiny but capable; serves at >500 tok/s on Groq. |
Gemini 1.5 Flash | $0.07 | $0.30 | 1,000k | Fast | Cheapest 1M-context model; ideal for high-throughput pipelines. | |
Gemini 2.0 Flash | $0.10 | $0.40 | 1,000k | Fast | 1M-token context for next to nothing; the best $/token deal right now. | |
DeepSeek Coder V2 | DeepSeek | $0.14 | $0.28 | 128k | Balanced | Built for code; competitive with GPT-4o on HumanEval at 1/15 the cost. |
GPT-4o mini | OpenAI | $0.15 | $0.60 | 128k | Fast | Cheapest production model from OpenAI; great for high-volume tasks. |
Command R | Cohere | $0.15 | $0.60 | 128k | Balanced | Optimised for retrieval and tool use at GPT-4o mini prices. |
Mistral Small 3 | Mistral | $0.20 | $0.60 | 32k | Fast | Excellent latency / cost ratio; production-grade fast tier. |
Claude 3 Haiku | Anthropic | $0.25 | $1.25 | 200k | Fast | Cheapest Anthropic option; good for classification + extraction. |
DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 64k | Frontier | Frontier-class quality at fast-tier prices; open weights. |
Qwen 2 72B (self-hosted) Estimate | Self-hosted | $0.40 | $0.40 | 128k | Open-source | Self-hosted estimate on 2× A100 80GB; ignores utilisation overhead. |
GPT-3.5 Turbo | OpenAI | $0.50 | $1.50 | 16.385k | Fast | Older but very cheap; mostly superseded by GPT-4o mini. |
Llama 3.1 70B Estimate | Meta | $0.59 | $0.79 | 128k | Balanced | Strong open-weights alternative to Claude 3.5 Haiku / GPT-4o mini. |
Claude 3.5 Haiku | Anthropic | $0.80 | $4.00 | 200k | Fast | Fast + cheap with a huge context window; great for RAG. |
Mixtral 8x22B Estimate | Mistral | $1.20 | $1.20 | 64k | Balanced | Open-weights MoE; available on Together / Fireworks. |
Gemini 1.5 Pro | $1.25 | $5.00 | 2,000k | Frontier | 2M-token context — unbeatable for whole-codebase or long-doc workflows. | |
Mistral Large 2 | Mistral | $2.00 | $6.00 | 128k | Frontier | EU-hosted frontier; multilingual + strong on code. |
Grok 2 | xAI | $2.00 | $10.00 | 128k | Frontier | Real-time data via X integration; competitive on reasoning benchmarks. |
GPT-4o | OpenAI | $2.50 | $10.00 | 128k | Frontier | Best all-rounder; strong at coding, vision, voice. |
Command R+ | Cohere | $2.50 | $10.00 | 128k | Frontier | Built for RAG; native citation support, strong multilingual. |
Llama 3.1 405B Estimate | Meta | $2.70 | $2.70 | 128k | Frontier | Open weights; pricing via providers like Together / Groq / Fireworks. |
OpenAI o1-mini | OpenAI | $3.00 | $12.00 | 128k | Reasoning | Reasoning at ~1/5 the cost of o1; weaker on world knowledge. |
Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | 200k | Frontier | Top-tier for nuanced writing, agentic tasks, and tool use. |
GPT-4 Turbo | OpenAI | $10.00 | $30.00 | 128k | Frontier | Legacy frontier; prefer GPT-4o unless you need its exact behaviour. |
OpenAI o1 | OpenAI | $15.00 | $60.00 | 200k | Reasoning | Long chain-of-thought reasoning; expensive but unbeatable on hard math/code. |
Claude 3 Opus | Anthropic | $15.00 | $75.00 | 200k | Frontier | Pricey legacy flagship; pick Sonnet 3.5 unless reproducibility matters. |
Prices reflect each provider's public list price. Some open-weights models are marked as estimates (varies by host: Together, Groq, Fireworks, etc.). Last refreshed June 2026.
How to reduce your LLM API costs
Six levers that consistently bring monthly LLM bills down by 30–70% in production.
Pick the right tier for the job
Use a cheap fast model (GPT-4o mini, Gemini Flash, Claude Haiku) for routing, classification, and most production tasks. Reserve frontier models for the hard 5–10% of calls.
Cache prompt prefixes
Anthropic, OpenAI, and Google all expose prompt caching. A long system prompt or RAG context replayed across requests can be served at ~10% of normal input price.
Cap the output, not the input
Output tokens are 3–5× more expensive than input. Set max_tokens, use stop sequences, and ask the model for structured JSON instead of prose whenever possible.
Route by complexity
Send simple queries to a fast model and only escalate to a frontier model when a confidence check or eval fails. A two-tier router cuts cost 40–70% in production.
Compress context before sending
Summarise long histories, dedupe retrieved chunks, and strip boilerplate from system prompts. Most teams over-fill the context window by 2–3×.
Batch & stream where you can
OpenAI and Anthropic Batch APIs run within 24h at ~50% off. Streaming doesn’t change the bill but lets you cancel mid-flight when the user navigates away.
Understanding LLM API pricing
LLM API pricing is almost always quoted per million tokens, separately for the prompt you send in (“input”) and the text the model writes back (“output”). A token is roughly 3–4 characters of English, so 1,000 tokens ≈ 750 words. The total bill for any given call is simply: input_tokens × input_rate + output_tokens × output_rate.
Three modifiers can change that base number meaningfully. Prompt caching lets you replay long system prompts or retrieved context at ≈10% of the normal input rate, which is enormous for RAG. Batch APIs (OpenAI, Anthropic) trade synchronous latency for a 50% discount on jobs that can wait up to 24 hours. And fine-tuning generally costs 1.5–3× the base inference rate, so the math only works if you’re saving on prompt length or quality at scale.
Most teams underestimate output cost. Output tokens are typically 3–5× more expensive than input tokens, and chatty models pad responses unless you cap them. The single highest-leverage change is usually setting max_tokens aggressively and asking for structured JSON instead of prose.
Frequently asked questions
Ready to scale with confident pricing?
Plug your real traffic into the calculator above, then dig into our directory of 1,000+ AI tools and frameworks to ship your next product.