When was Llama 3.1 405B released?

Llama 3.1 405B was released by Meta on 23 July 2024.

How much does Llama 3.1 405B cost?

Llama 3.1 405B costs $2.70 per 1 million input tokens and $2.70 per 1 million output tokens via Meta’s API. At a typical 3:1 input/output ratio the blended cost works out to about $2.70 per 1M tokens.

How smart is Llama 3.1 405B?

Llama 3.1 405B scores 64 out of 100 on our intelligence index — a capable mid-tier composite of MMLU, MMLU Pro, GPQA, MATH and HumanEval benchmark scores. That places it #16 of 22 language models we track.

What is the context window of Llama 3.1 405B?

Llama 3.1 405B has a 128k-token context window — roughly 96,000 words of text. That’s what you can fit into a single request (your prompt plus all uploaded documents and the model’s response combined).

What is Llama 3.1 405B best for?

Llama 3.1 405B is most useful for research, distillation and custom fine-tunes. Its key strengths are open weights at frontier scale and no usage limits.

How fast is Llama 3.1 405B?

Llama 3.1 405B generates around 32 output tokens per second with a typical time-to-first-token of 700 ms on the major API providers. Reasoning models that "think" before answering will appear slower on tokens-per-second since they spend time on internal chain-of-thought.

Is Llama 3.1 405B better than Llama 3.3 70B?

Llama 3.3 70B scores higher on our intelligence index (66 vs 64), so on the hardest tasks Llama 3.3 70B typically wins. Llama 3.1 405B can still be the better pick on price, latency or specific capabilities. See our full Llama 3.1 405B vs Llama 3.3 70B comparison at /ai-models/compare/llama-3-3-70b-vs-llama-3-1-405b/.

What is the best alternative to Llama 3.1 405B?

The closest alternatives to Llama 3.1 405B are Llama 3.3 70B, DeepSeek R1 and DeepSeek V3. Each shares most of Llama 3.1 405B’s use-cases — pick by price, context window or specific capability rather than headline intelligence.

Is Llama 3.1 405B open source?

Yes. Llama 3.1 405B is open-source — the weights are publicly available, and you can self-host it or use a hosted provider (Together, Fireworks, Groq, Replicate). Some open-source licenses include usage caveats; check the model’s license file before deploying.

Can I use Llama 3.1 405B for free?

Yes. Llama 3.1 405B is open-weights, so you can download it and run it locally for free (just your hardware cost). Hosted access is paid via providers like Together, Fireworks, Groq and Replicate.

Meta Open sourceJul 2024

Llama 3.1 405B

The original "open-source GPT-4" — largest publicly-released weights.

Compare with…Open docs

Intelligence index

64/ 100

vs all models27th pctile

Composite of MMLU, GPQA, MATH & HumanEval

Speed

32tok/s

vs all models0th pctile

Median across providers, steady state

Blended price

$2.70/ 1M tokens

vs all models41th pctile

3:1 input:output blend

Llama 3.1 405B Overview at a Glance

Llama 3.1 405B is a large language model from Meta, first released on 23 July 2024. It is open-source (open-weights) and sits in the frontier, meta, open weights, and enterprise categories of our catalog. The original "open-source GPT-4" — largest publicly-released weights. This page covers Llama 3.1 405B pricing, benchmarks, API limits, speed, modalities, best use cases, and how it compares with similar models — so you can decide whether it belongs in your stack in 2026.

As a language model, Llama 3.1 405B is evaluated on reasoning quality, coding ability, latency, context window size, and dollars-per-million-tokens. The context window is 128k tokens (about 96k words), which determines how much prompt, document, and conversation history you can send in one request. At a typical 3:1 input-to-output mix, the blended API price is about $2.70 per 1M tokens. On our intelligence index it ranks #16 of 22 language models we track with a score of 64/100 (capable mid-tier).

Teams usually shortlist Llama 3.1 405B when they need a dependable Meta option for production chat, agents, retrieval-augmented generation, or coding copilots. Common fits include research, distillation, and custom fine-tunes. Reviewers consistently call out open weights at frontier scale and no usage limits as standout strengths. Trade-offs to weigh include slow and expensive to host (needs 8×h100). The sections below break down pricing tables, benchmark charts, token limits, input/output modalities, and head-to-head comparisons so long-tail queries — from “Llama 3.1 405B API pricing” to “Llama 3.1 405B vs Llama 3.3 70B” — are answered on this page.

If you are migrating from an older Meta model or switching labs entirely, treat this page as a decision brief: skim the overview stats, confirm API pricing fits your volume, check whether the context window covers your longest documents, then validate quality on a golden set of prompts. Benchmarks and charts help shortlist; your own evals decide. We refresh catalog numbers periodically (last update 2026-06) so figures stay useful through the year.

Context window: 128k tokens
Max output: 4k tokens
Input price: $2.70 / 1M tokens
Output price: $2.70 / 1M tokens
Time to first token: 0.7s
Input modalities: text
Output modalities: text
License: Open source
Provider: Meta

Strengths

Open weights at frontier scale
No usage limits

Weaknesses

Slow
Expensive to host (needs 8×H100)

Best for

Research
Distillation
Custom fine-tunes

Llama 3.1 405B Pricing

Llama 3.1 405B uses token-based API pricing from Meta. You pay $2.70 per million input tokens and $2.70 per million output tokens. For planning budgets we quote a blended rate of $2.70 per 1M tokens at a 3:1 input-to-output ratio — the same convention used across our catalog so models are comparable. Output tokens usually dominate cost for chatty or agentic workloads, so watch generation length and system-prompt size.

When estimating production spend, multiply expected monthly tokens by the blended rate, then add a buffer for retries, tool-calling loops, and RAG context. Because Llama 3.1 405B is open-weights, self-hosting can beat API pricing at high volume once GPU utilization is solid — hosted APIs remain cheaper to start.

Also compare Llama 3.1 405B against cheaper siblings from Meta for router patterns: send easy traffic to a mini/flash tier and reserve Llama 3.1 405B for hard reasoning. That hybrid design often cuts billable tokens 30–70% without users noticing quality drops on simple turns.

Input price: $2.70 / 1M tokens
Output price: $2.70 / 1M tokens
Blended (3:1): $2.70 / 1M tokens

Llama 3.1 405B Benchmarks

Public benchmark scores help compare Llama 3.1 405B with other LLMs on knowledge, graduate-level science, competition math, and coding. Reported figures in our catalog include MMLU 88.6, MMLU Pro 73.3, GPQA 51.1, MATH 73.8, and HumanEval 89. These are not a substitute for evals on your own prompts, but they are useful for shortlisting.

Our intelligence index (64/100) normalizes those benchmarks into a single capable mid-tier score so you can scan the leaderboard quickly. The performance chart below shows each benchmark against the current catalog leader.

MMLU

General knowledge across 57 subjects

88.6

leader: 91.8

MMLU Pro

Harder MMLU successor with more reasoning

73.3

leader: 80.0

GPQA

Graduate-level science Q&A

51.1

leader: 78.0

MATH

Competition mathematics

73.8

leader: 94.8

HumanEval

Python code generation pass@1

89.0

leader: 95.8

Llama 3.1 405B API Pricing

API pricing for Llama 3.1 405B is what you pay when calling Meta’s developer endpoint (or a marketplace such as Azure, Bedrock, or Vertex when available). Unlike consumer chat apps with flat subscriptions, API bills scale with tokens processed. Cache prompt prefixes where the provider supports it, batch non-interactive jobs, and prefer smaller sibling models for classification or routing when full Llama 3.1 405B quality is unnecessary.

To convert catalog numbers into a monthly forecast: estimate average input tokens per request (system prompt + user message + retrieved context), average output tokens, and request volume. Cost ≈ requests × ((inputTokens/1e6) × inputPrice + (outputTokens/1e6) × outputPrice). Our LLM pricing calculator can stress-test scenarios if you need a second opinion against peers.

Always verify live rates on the official docs — our figures are refreshed periodically (last catalog update: 2026-06) and providers change list prices. Official reference: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1.

Llama 3.1 405B Context Window

Llama 3.1 405B offers a 128k-token context window — roughly about 96k words of English text. Everything in a single API call counts against that budget: system instructions, chat history, retrieved documents, tool schemas, and the model’s reply. Exceeding the window truncates or errors depending on the provider.

Large windows help with long PDFs, multi-file code reviews, and multi-hour agent traces, but bigger contexts also cost more tokens and can add latency. Prefer retrieval that stuffs only relevant chunks, summarize old turns, and reserve headroom for up to 4k output tokens.

Llama 3.1 405B Input / Output Modalities

Llama 3.1 405B accepts text as input and produces text as output. Knowing the modality matrix matters when you design pipelines — for example, vision-capable language models can take screenshots or PDFs as images, while pure text models need an OCR or captioning step first.

If you need bidirectional voice, native video understanding, or tool-use with multimodal arguments, confirm support in Meta’s API schema rather than assuming parity with the consumer chat app. Modality support also affects pricing: image or audio inputs may be tokenized differently than plain text.

Document which of Llama 3.1 405B’s listed modalities you will actually send in production. Turning on unused multimodal features can change tokenizers, rate limits, and safety filters unexpectedly.

Inputs: text
Outputs: text

Llama 3.1 405B Token Limits

Token limits define how much Llama 3.1 405B can read and write per request. Total context is capped at 128k tokens. Maximum completion length is 4k tokens — even if context remains, the model stops generating beyond that ceiling unless you continue in a follow-up call. Providers may also enforce organization-level rate limits (RPM/TPM) separate from these per-request caps.

Practical tip: set max_tokens intentionally. Leaving it unbounded wastes budget on verbose answers; setting it too low truncates JSON or code. For structured outputs, prefer schemas/tool calls and keep completions tight.

Context window: 128k tokens
Max output: 4k tokens

Llama 3.1 405B Speed

Speed for Llama 3.1 405B is measured two ways: time-to-first-token (how quickly streaming starts) and steady-state tokens per second. Catalog median throughput is about 32 tok/s. Typical TTFT is 700 ms. Reasoning-heavy modes that think before answering will look slower on tok/s even when quality is higher.

Interactive chat wants low TTFT; batch extraction can tolerate higher latency for cheaper regions or providers. If Llama 3.1 405B is too slow for your UX, evaluate a “mini/flash/haiku” sibling from the same lab before switching ecosystems.

Throughput: 32 tokens/sec
Time to first token: 700 ms
Speed percentile: Faster than ~0% of tracked LLMs

Llama 3.1 405B Performance Charts

The charts on this page visualize Llama 3.1 405B against catalog peers. Benchmark bars show academic scores versus the current leader; the similar-models comparison table plots intelligence, speed, and blended price so you can see trade-offs at a glance. Use them to answer “is Llama 3.1 405B fast enough?” and “is the quality jump worth the premium?” without opening a spreadsheet.

Benchmark performance vs catalog leaders

MMLU

General knowledge across 57 subjects

88.6

leader: 91.8

MMLU Pro

Harder MMLU successor with more reasoning

73.3

leader: 80.0

GPQA

Graduate-level science Q&A

51.1

leader: 78.0

MATH

Competition mathematics

73.8

leader: 94.8

HumanEval

Python code generation pass@1

89.0

leader: 95.8

Intelligence index vs similar models

Llama 3.1 405B64

Llama 3.3 70B66

DeepSeek R173

DeepSeek V367

Comparison with Similar Models

Choosing an AI model is rarely absolute — it is relative to the next-best option. Llama 3.1 405B is most often weighed against Llama 3.3 70B, DeepSeek R1, and DeepSeek V3. Compare intelligence (or generation quality), latency, price, license, and modality support. A slightly weaker but much cheaper model can win for high-volume workloads; a pricier frontier model wins when a single mistake is expensive.

Use the links and table below for structured Llama 3.1 405B vs alternatives research. We also maintain dedicated head-to-head pages for popular matchups when available. If you are standardizing on Meta, check sibling models from the same lab before leaving the ecosystem.

A practical bake-off: pick 20–50 real prompts, score accuracy/style, measure p50/p95 latency, and compute cost at projected volume for Llama 3.1 405B and two peers. Ship the winner behind a feature flag so you can reverse the decision without a rewrite.

Model	Provider	Intelligence	Speed	Price
Llama 3.1 405B	Meta	64	32 t/s	$2.70/1M
Llama 3.3 70B	Meta	66	200 t/s	$0.27/1M
DeepSeek R1	DeepSeek	73	60 t/s	$0.96/1M
DeepSeek V3	DeepSeek	67	90 t/s	$0.48/1M

Llama 3.1 405B Best Use Cases

Best use cases for Llama 3.1 405B follow from its strengths, price point, and modality support. Match the model to the job: frontier reasoning for hard planning, fast/cheap tiers for classification, image/video/speech specialists for media pipelines.

Based on catalog notes, Llama 3.1 405B is a particularly strong fit for research, distillation, and custom fine-tunes. Validate with a short bake-off on your real prompts before a full cutover.

Anti-patterns: do not use a frontier-priced model like a generic classifier if a smaller model scores within a point on your eval; do not stuff entire corpora into context when retrieval would be cheaper; and do not skip structured outputs if you plan to parse Llama 3.1 405B responses in code.

Research
Distillation
Custom fine-tunes

Llama 3.1 405B Pros & Cons

Every model trades quality, speed, cost, and openness. Here is a concise pros and cons list for Llama 3.1 405B drawn from our catalog strengths and weaknesses — pair it with your own evals before committing.

Read pros as “reasons to shortlist” and cons as “risks to mitigate,” not as deal-breakers in isolation. A listed weakness (for example higher price or smaller context) may be irrelevant if your workload is bursty, short-context, or already standardized on Meta.

After scanning this list, jump to the comparison table and FAQ for decision support, then lock a trial window with success metrics before replacing a production model with Llama 3.1 405B.

Pros

Open weights at frontier scale
No usage limits

Cons

Slow
Expensive to host (needs 8×H100)

Llama 3.1 405B — frequently asked questions

Llama 3.1 405B is a large language model from Meta, released on 23 July 2024. The original "open-source GPT-4" — largest publicly-released weights.

Need help choosing between models?

Compare every option in one sortable table — intelligence, speed and price on a single page.

All AI models LLM pricing calculator