A standardized classification for AI inference by capability and speed, with current pricing.
The AI inference market we cover now has 50 models across eight grades. The headline number: The A/B spread has compressed to 1.8×. Median A-grade input pricing is $1.75 per million tokens versus $1.00 for B-grade. Frontier-level intelligence is no longer scarce. Six providers now ship A-grade models, with GLM-5 and Kimi K2.5 delivering frontier capability at $0.60–1.00 input. Those prices that would have only bought B-grade six months ago. The workhorse grade B-Fast contains 14 models from nine providers, with input prices ranging from $0.10 (MiMo V2 Flash) to $3.00 (Claude Sonnet 4.6). A-Instant remains empty — no model yet combines frontier intelligence with frontier throughput. It is an open question whether the market will ever produce ultra-premium models that lead on both capability and speed simultaneously, or whether the smartest and the fastest will always be different models serving different buyers.
The A/B Spread: 1.8×
Median A-tier input price is 1.8× the median B-tier price — down from 4.0× in Edition 4.
| Model | Provider | Intelligence | Speed tok/s | TT500 s | Input $/MTok | Output $/MTok | Context |
|---|---|---|---|---|---|---|---|
| A-Fast | |||||||
| Claude Opus 4.6 (Adaptive) reasoning | Anthropic | 53 | 44.1 tok/s | 28.7s | $5.00 | $25.00 | 1M |
| GLM-5 open reasoning | Zhipu AI | 50 | 59.2 tok/s | 10.4s | $1.00 | $3.20 | 128K |
| Kimi K2.5 reasoning | Moonshot AI | 47 | 37.8 tok/s | 16.2s | $0.60 | $3.00 | 128K |
| Claude Opus 4.6 | Anthropic | 46 | 41.3 tok/s | 14.0s | $5.00 | $25.00 | 1M |
| Qwen3.5-397B-A17B open reasoning | Alibaba/Qwen | 45 | 65.5 tok/s | 10.0s | $0.60 | $3.60 | 131K |
| A-Bulk | |||||||
| Gemini 3.1 Pro Preview reasoning | 57 | 99.6 tok/s | 46.5s | $2.00 | $12.00 | 1M | |
| GPT-5.3 Codex reasoning | OpenAI | 54 | 63.5 tok/s | 104.9s | $1.75 | $14.00 | 200K |
| Claude Sonnet 4.6 (Adaptive) reasoning | Anthropic | 52 | 57.3 tok/s | 113.4s | $3.00 | $15.00 | 1M |
| GPT-5.2 reasoning | OpenAI | 51 | 60.0 tok/s | 81.4s | $1.75 | $14.00 | 400K |
| B-Instant | |||||||
| Gemini 3.1 Flash Lite Preview | 34 | 339.1 tok/s | 7.0s | $0.25 | $1.50 | 1M | |
| GPT-OSS 120B open | OpenAI | 33 | 268.4 tok/s | 2.8s | $0.15 | $0.60 | 131K |
| B-Fast | |||||||
| Claude Sonnet 4.6 | Anthropic | 44 | 42.9 tok/s | 12.6s | $3.00 | $15.00 | 1M |
| Claude Sonnet 4.5 (Thinking) reasoning | Anthropic | 43 | 45.5 tok/s | 20.5s | $3.00 | $15.00 | 1M |
| Grok 4 reasoning | xAI | 42 | 40.8 tok/s | 24.5s | $3.00 | $15.00 | 128K |
| MiniMax M2.5 | MiniMax | 42 | 48.6 tok/s | 13.6s | $0.30 | $1.20 | TBD |
| o3 reasoning | OpenAI | 38 | 55.1 tok/s | 23.0s | $2.00 | $8.00 | 200K |
| Claude Sonnet 4.5 | Anthropic | 37 | 41.8 tok/s | 13.2s | $3.00 | $15.00 | 1M |
| KAT Coder Pro V1 open | KwaiKAT | 36 | 55.8 tok/s | 10.8s | $0.30 | $1.20 | TBD |
| Nova 2.0 Pro Preview * | Amazon | 36 | 145.9 tok/s | 29.3s | N/A | N/A | 300K |
| Gemini 3 Flash | 35 | 131.6 tok/s | 5.7s | $0.50 | $3.00 | 1M | |
| Gemini 2.5 Pro reasoning | 35 | 132.3 tok/s | 28.0s | $1.25 | $10.00 | 1M | |
| DeepSeek V3.2 open | DeepSeek | 32 | 26.8 tok/s | 20.6s | $0.28 | $0.42 | 164K |
| Claude Haiku 4.5 | Anthropic | 31 | 87.9 tok/s | 6.3s | $1.00 | $5.00 | 200K |
| MiMo V2 Flash open reasoning | Xiaomi | 30 | 116.4 tok/s | 6.3s | $0.10 | $0.30 | TBD |
| Grok Code Fast 1 | xAI | 29 | 182.2 tok/s | 6.5s | $0.20 | $1.50 | 131K |
| B-Bulk | |||||||
| o3-pro reasoning | OpenAI | 41 | 15.2 tok/s | 195.6s | $20.00 | $80.00 | 200K |
| o4-mini reasoning | OpenAI | 33 | 119.5 tok/s | 39.5s | $1.10 | $4.40 | 200K |
| Qwen3-235B-A22B open reasoning * | Alibaba/Qwen | 30 | 37.4 tok/s | 70.3s | N/A | N/A | 131K |
| C-Instant | |||||||
| GPT-OSS 20B open | OpenAI | 24 | 285.8 tok/s | 2.5s | $0.06 | $0.20 | 131K |
| Gemini 2.5 Flash | 21 | 220.4 tok/s | 2.8s | $0.30 | $2.50 | 1M | |
| Gemini 2.5 Flash Lite | 13 | 293.9 tok/s | 2.2s | $0.10 | $0.40 | 1M | |
| Nova Micro | Amazon | 10 | 272.7 tok/s | 2.5s | $0.04 | $0.14 | 128K |
| C-Fast | |||||||
| Grok 4.1 Fast | xAI | 24 | 99.6 tok/s | 5.7s | $0.20 | $0.50 | 2M |
| Nemotron 3 Nano open * | NVIDIA | 24 | 146.8 tok/s | 18.0s | N/A | N/A | 128K |
| Llama 4 Maverick open | Meta | 18 | 117.8 tok/s | 5.0s | $0.31 | $0.85 | 1M |
| GPT-4o | OpenAI | 17 | 87.8 tok/s | 6.5s | $2.50 | $10.00 | 128K |
| Llama 3.1 405B open | Meta | 17 | 25.8 tok/s | 22.0s | $4.00 | $9.50 | 128K |
| ERNIE 4.5 * | Baidu | 15 | 24.4 tok/s | 24.1s | N/A | N/A | 128K |
| Llama 4 Scout open | Meta | 14 | 125.4 tok/s | 4.8s | $0.17 | $0.66 | 10M |
| Llama 3.3 70B open | Meta | 14 | 92.4 tok/s | 6.8s | $0.58 | $0.71 | 128K |
| GPT-4o Mini | OpenAI | 13 | 37.8 tok/s | 16.4s | $0.15 | $0.60 | 128K |
| Nova Lite | Amazon | 13 | 111.2 tok/s | 5.3s | $0.06 | $0.24 | 300K |
| Llama 3.1 8B open | Meta | 12 | 154.0 tok/s † | 4.2s | $0.10 † | $0.10 | 128K |
| Llama 3.1 70B open | Meta | 12 | 32.2 tok/s † | 17.1s | $0.56 † | $0.56 | 128K |
| Mistral Large open | Mistral | 10 | 51.0 tok/s | 10.8s | $4.00 | $12.00 | 128K |
| Mistral Small open | Mistral | 10 | 154.3 tok/s | 3.8s | $0.20 | $0.60 | 128K |
| Gemma 3 27B open | 10 | 30.8 tok/s | 18.2s | $0.00 † | $0.00 | 128K | |
| Mistral Medium | Mistral | 9 | 90.7 tok/s | 6.8s | $2.75 | $8.10 | 128K |
| C-Bulk | |||||||
| GPT-5 Nano reasoning | OpenAI | 27 | 135.8 tok/s | 111.6s | $0.05 | $0.40 | 128K |
| Phi-4 open | Microsoft | 10 | 7.2 tok/s | 73.2s | $0.13 | $0.50 | 16K |
Grade Definitions
A-Instant
Frontier capability at extreme throughput. No current model occupies this grade. It may never be occupied. Frontier intelligence and frontier speed pull in opposite directions — the path to the top of either dimension is specialization, not generalization. For a model to land here, money would have to buy a way out of that tradeoff, and there would have to be enough demand to justify the cost. Since capability thresholds rise with each vintage, a model would need to lead on both dimensions simultaneously against a moving target. The grade exists in the framework for completeness, not as a prediction.
A-Fast
Frontier capability, interactive speed. The best models available, delivering a useful response (TT500) within 30 seconds. Use when the task is genuinely hard and someone is waiting for the answer.
A-Bulk
Frontier capability, latency-tolerant. The same top-tier intelligence, but through slower endpoints — extended reasoning with high thinking overhead, batch processing, or models that take more than 30 seconds to produce 500 tokens. Use when you need the best answer and can wait for it.
B-Instant
Production capability at extreme throughput. Capable enough for most tasks, fast enough (≥ 200 tok/s) to power high-volume realtime pipelines. A single serving instance handles significantly more traffic than a standard deployment.
B-Fast
Production capability, interactive speed. The workhorse grade. These models handle the vast majority of real-world tasks competently and deliver a response within 30 seconds. The most competitive grade by provider count and the widest price range. Where most inference is purchased.
B-Bulk
Production capability, latency-tolerant. Reasoning models and batch endpoints at the production tier. Models whose extended thinking pushes TT500 above 30 seconds. Good for complex tasks that benefit from chain-of-thought but don’t need frontier intelligence.
C-Instant
Efficient models at extreme throughput. The fastest models in the market, optimized purely for speed and cost. Use for high-volume classification, routing, embedding, or any pipeline where throughput is the binding constraint.
C-Fast
Efficient models, interactive speed. Good enough and fast enough for simple interactive tasks — summarization, extraction, simple Q&A, chatbot scaffolding. The cheapest interactive inference available.
C-Bulk
Efficient models, latency-tolerant. Budget models that don’t meet the Fast speed spec. The cheapest inference available.
Common Comparisons
How models compare across grades, providers, and price points. Each comparison uses the grades and data from the current edition.
Claude Opus 4.6 (A-Fast) vs Gemini 3.1 Pro (A-Bulk) for reasoning tasks
Both are frontier models, but they land in different speed classes. Gemini 3.1 Pro Preview scores highest in the index at Intelligence 57, versus 46 for Claude Opus 4.6 (non-adaptive) or 53 for the adaptive reasoning variant. However, Gemini's reasoning overhead pushes TT500 to 46.5 seconds, placing it in A-Bulk rather than A-Fast. Claude Opus 4.6 delivers a response in 14.0 seconds (28.7s in adaptive mode) — genuinely interactive. The price trade-off: Opus costs $5.00/$25.00 per MTok versus Gemini's $2.00/$12.00. Choose Opus when someone is waiting for the answer; choose Gemini when you need the highest intelligence score and can tolerate the latency.
Claude Opus 4.6 vs GPT-5.3 Codex for coding and deep reasoning
GPT-5.3 Codex scores Intelligence 54 — second-highest in the index — but its extended reasoning pushes TT500 to 104.9 seconds, making it A-Bulk. Claude Opus 4.6 (Adaptive) scores 53 with a TT500 of 28.7 seconds, keeping it in A-Fast. Codex is slightly cheaper at $1.75/$14.00 versus Opus's $5.00/$25.00. For asynchronous coding tasks and batch evaluation, Codex's higher intelligence and lower price make it compelling. For interactive coding assistants where a developer is waiting, Opus's sub-30-second response is the practical differentiator.
Claude Sonnet 4.6 (B-Fast) vs Claude Opus 4.6 (A-Fast) — when to upgrade
Both are from Anthropic. Claude Sonnet 4.6 scores Intelligence 44 (B-tier) at $3.00/$15.00 per MTok. Claude Opus 4.6 scores 46 (A-tier) at $5.00/$25.00. The intelligence gap is small — just 2 points — but it crosses the A/B tier boundary at 45. For most production workloads, Sonnet at B-Fast delivers comparable results at 40% lower cost. Opus is worth the premium for tasks where the marginal intelligence improvement matters: complex multi-step reasoning, novel problem-solving, or agentic workflows where small accuracy gains compound across steps.
o3 (B-Fast) vs Claude Opus 4.6 (A-Fast) — OpenAI reasoning vs Anthropic frontier
OpenAI's o3 is a reasoning model that scores Intelligence 38 (B-tier) with a TT500 of 23.0 seconds at $2.00/$8.00 per MTok. Claude Opus 4.6 scores 46 (A-tier) with a TT500 of 14.0 seconds at $5.00/$25.00. Opus is both smarter and faster, but costs 2.5–3× more. For tasks where B-tier intelligence is sufficient — structured extraction, content generation, standard coding — o3 offers strong value. For tasks that genuinely require frontier capability, the tier difference is real.
Qwen3.5-397B (A-Fast) vs Claude Opus 4.6 (A-Fast) — open-weight frontier vs closed
Both are A-Fast, but with dramatically different pricing. Qwen3.5-397B-A17B, Alibaba's open-weight MoE model, scores Intelligence 45 at $0.60/$3.60 per MTok. Claude Opus 4.6 scores 46 at $5.00/$25.00 — roughly 8× more expensive on input. Qwen is also faster at 65.5 tok/s versus Opus's 41.3 tok/s. The trade-off: Opus has a 1M context window versus Qwen's 131K, and Anthropic's safety and compliance infrastructure may matter for regulated use cases. For cost-sensitive frontier workloads, Qwen and the other open-weight A-tier models (GLM-5 at $1.00, Kimi K2.5 at $0.60) are reshaping the economics of the A-tier.
DeepSeek V3.2 vs Claude Haiku 4.5 — budget B-Fast showdown
Both sit in B-Fast, but DeepSeek V3.2 is the price outlier of the entire index. At $0.28/$0.42 per MTok with Intelligence 32, it delivers B-tier capability at a price below most C-tier models. Claude Haiku 4.5 scores Intelligence 31 at $1.00/$5.00 — nearly 4× more expensive on input and 12× on output. Haiku is faster (87.9 tok/s vs 26.8 tok/s, TT500 6.3s vs 20.6s) and has a wider ecosystem through Anthropic's API. DeepSeek is open-weight, enabling self-hosting. For high-volume B-tier workloads where cost dominates, DeepSeek V3.2 is hard to beat.
Gemini 3 Flash (B-Fast) vs Claude Sonnet 4.5 (B-Fast) — the workhorse tier
The B-Fast grade is the most competitive, with 14 models from nine providers. Gemini 3 Flash scores Intelligence 35 at $0.50/$3.00 with 131.6 tok/s throughput — nearly the fastest in B-Fast. Claude Sonnet 4.5 scores 37 at $3.00/$15.00 — 6× more expensive on input. Both have 1M context windows. For latency-sensitive agent workflows that need rapid tool calling, Gemini 3 Flash's speed advantage matters. For tasks where the 2-point intelligence edge justifies the premium, Sonnet 4.5 remains a strong choice.
o3-pro (B-Bulk) vs Claude Sonnet 4.6 Adaptive (A-Bulk) — premium reasoning
OpenAI's o3-pro is the most expensive model in the index at $20.00/$80.00 per MTok, scoring Intelligence 41 (B-tier) with a TT500 of 195.6 seconds. Claude Sonnet 4.6 (Adaptive) scores Intelligence 52 (A-tier) at $3.00/$15.00 with TT500 of 113.4 seconds. Sonnet Adaptive is both smarter and cheaper, by a wide margin: it delivers A-tier intelligence at one-sixth the input price and one-fifth the output price. o3-pro's value proposition depends on whether its reasoning approach — extremely deep chain-of-thought — produces better results on specific tasks despite the lower composite score.
What Changed
- 2026-03 Revised speed methodology: TT500 for Fast/Bulk (30s), output throughput for Instant (200 tok/s). Added Instant speed class. Capability thresholds adjusted: A ≥ 45, B ≥ 28. 50 models across 8 of 9 grades. DeepSeek R1 removed — no current speed data available from Artificial Analysis.
- 2026-03 Added structured data, meta tags, sitemap, and semantic markup for search engine and AI discoverability.
- 2026-03 Published design principles. Made edition changelog visible on Index page.
- 2026-03 Added A/B spread metric. Current spread: 4.0× (median A-tier vs. B-tier input price).
- 2026-03 Initial publication. 22 models across six grades.