Responsible AI Use Disclaimer: The tools listed are for informational purposes. Users are responsible for adhering to ethical guidelines. Learn more.

Built for builders

LLM API Pricing Calculator

Compare token costs across every major LLM provider — OpenAI, Anthropic, Google, Mistral, Meta and more. Estimate your monthly spend in seconds.

  • 25+ models
  • Live calculator
  • Updated quarterly
  • No sign-up

Estimate your costs

Pick a model, dial in your traffic, and see the monthly bill update live.

Configure usage

Selected: GPT-4o mini · Context 128k tokens · Fast

800

Average prompt + system + retrieved context per call.

400

Average response length you ask the model to produce.

500

≈ 15,000 requests/month

Estimated monthly cost

$5.40

Based on 15,000 requests · 18.00M tokens

Input 33%Output 67%

Input tokens

12,000,000 tok @ $0.15/1M

$1.80

Output tokens

6,000,000 tok @ $0.6/1M

$3.60

Monthly total

$5.40

Annual projection

$64.80

Compare

All models side-by-side

Click any column to sort. Filter by provider, tier, or keyword. Prices are quoted per 1M tokens.

Notes
Llama 3.1 8B (self-hosted)
Estimate
Self-hosted$0.05$0.05128kOpen-sourceSingle GPU (L4 / A10) is enough at moderate throughput.
Llama 3.1 8B
Estimate
Meta$0.06$0.06128kFastTiny but capable; serves at >500 tok/s on Groq.
Gemini 1.5 Flash
Google$0.07$0.301,000kFastCheapest 1M-context model; ideal for high-throughput pipelines.
Gemini 2.0 Flash
Google$0.10$0.401,000kFast1M-token context for next to nothing; the best $/token deal right now.
DeepSeek Coder V2
DeepSeek$0.14$0.28128kBalancedBuilt for code; competitive with GPT-4o on HumanEval at 1/15 the cost.
GPT-4o mini
OpenAI$0.15$0.60128kFastCheapest production model from OpenAI; great for high-volume tasks.
Command R
Cohere$0.15$0.60128kBalancedOptimised for retrieval and tool use at GPT-4o mini prices.
Mistral Small 3
Mistral$0.20$0.6032kFastExcellent latency / cost ratio; production-grade fast tier.
Claude 3 Haiku
Anthropic$0.25$1.25200kFastCheapest Anthropic option; good for classification + extraction.
DeepSeek V3
DeepSeek$0.27$1.1064kFrontierFrontier-class quality at fast-tier prices; open weights.
Qwen 2 72B (self-hosted)
Estimate
Self-hosted$0.40$0.40128kOpen-sourceSelf-hosted estimate on 2× A100 80GB; ignores utilisation overhead.
GPT-3.5 Turbo
OpenAI$0.50$1.5016.385kFastOlder but very cheap; mostly superseded by GPT-4o mini.
Llama 3.1 70B
Estimate
Meta$0.59$0.79128kBalancedStrong open-weights alternative to Claude 3.5 Haiku / GPT-4o mini.
Claude 3.5 Haiku
Anthropic$0.80$4.00200kFastFast + cheap with a huge context window; great for RAG.
Mixtral 8x22B
Estimate
Mistral$1.20$1.2064kBalancedOpen-weights MoE; available on Together / Fireworks.
Gemini 1.5 Pro
Google$1.25$5.002,000kFrontier2M-token context — unbeatable for whole-codebase or long-doc workflows.
Mistral Large 2
Mistral$2.00$6.00128kFrontierEU-hosted frontier; multilingual + strong on code.
Grok 2
xAI$2.00$10.00128kFrontierReal-time data via X integration; competitive on reasoning benchmarks.
GPT-4o
OpenAI$2.50$10.00128kFrontierBest all-rounder; strong at coding, vision, voice.
Command R+
Cohere$2.50$10.00128kFrontierBuilt for RAG; native citation support, strong multilingual.
Llama 3.1 405B
Estimate
Meta$2.70$2.70128kFrontierOpen weights; pricing via providers like Together / Groq / Fireworks.
OpenAI o1-mini
OpenAI$3.00$12.00128kReasoningReasoning at ~1/5 the cost of o1; weaker on world knowledge.
Claude 3.5 Sonnet
Anthropic$3.00$15.00200kFrontierTop-tier for nuanced writing, agentic tasks, and tool use.
GPT-4 Turbo
OpenAI$10.00$30.00128kFrontierLegacy frontier; prefer GPT-4o unless you need its exact behaviour.
OpenAI o1
OpenAI$15.00$60.00200kReasoningLong chain-of-thought reasoning; expensive but unbeatable on hard math/code.
Claude 3 Opus
Anthropic$15.00$75.00200kFrontierPricey legacy flagship; pick Sonnet 3.5 unless reproducibility matters.

Prices reflect each provider's public list price. Some open-weights models are marked as estimates (varies by host: Together, Groq, Fireworks, etc.). Last refreshed June 2026.

How to reduce your LLM API costs

Six levers that consistently bring monthly LLM bills down by 30–70% in production.

  • Pick the right tier for the job

    Use a cheap fast model (GPT-4o mini, Gemini Flash, Claude Haiku) for routing, classification, and most production tasks. Reserve frontier models for the hard 5–10% of calls.

  • Cache prompt prefixes

    Anthropic, OpenAI, and Google all expose prompt caching. A long system prompt or RAG context replayed across requests can be served at ~10% of normal input price.

  • Cap the output, not the input

    Output tokens are 3–5× more expensive than input. Set max_tokens, use stop sequences, and ask the model for structured JSON instead of prose whenever possible.

  • Route by complexity

    Send simple queries to a fast model and only escalate to a frontier model when a confidence check or eval fails. A two-tier router cuts cost 40–70% in production.

  • Compress context before sending

    Summarise long histories, dedupe retrieved chunks, and strip boilerplate from system prompts. Most teams over-fill the context window by 2–3×.

  • Batch & stream where you can

    OpenAI and Anthropic Batch APIs run within 24h at ~50% off. Streaming doesn’t change the bill but lets you cancel mid-flight when the user navigates away.

Understanding LLM API pricing

LLM API pricing is almost always quoted per million tokens, separately for the prompt you send in (“input”) and the text the model writes back (“output”). A token is roughly 3–4 characters of English, so 1,000 tokens ≈ 750 words. The total bill for any given call is simply: input_tokens × input_rate + output_tokens × output_rate.

Three modifiers can change that base number meaningfully. Prompt caching lets you replay long system prompts or retrieved context at ≈10% of the normal input rate, which is enormous for RAG. Batch APIs (OpenAI, Anthropic) trade synchronous latency for a 50% discount on jobs that can wait up to 24 hours. And fine-tuning generally costs 1.5–3× the base inference rate, so the math only works if you’re saving on prompt length or quality at scale.

Most teams underestimate output cost. Output tokens are typically 3–5× more expensive than input tokens, and chatty models pad responses unless you cap them. The single highest-leverage change is usually setting max_tokens aggressively and asking for structured JSON instead of prose.

Frequently asked questions

Ready to scale with confident pricing?

Plug your real traffic into the calculator above, then dig into our directory of 1,000+ AI tools and frameworks to ship your next product.