The Case for Inference Grades
A framework for standardizing how we talk about AI compute — and an invitation to improve it.
When oil is traded, nobody buys “a barrel of oil.” They buy West Texas Intermediate Light Sweet Crude, or Brent, or Dubai Heavy Sour. Each name communicates a precise profile — density, sulfur content, delivery point — that determines what the oil can be used for and what it’s worth. A refinery optimized for light sweet crude can’t just swap in heavy sour and expect the same output. The names exist because the differences matter, and because a market can’t function without shared vocabulary for what’s being traded.
AI inference is getting there. It already has real, consequential variation — but no shared language for describing it.
When a developer says “I need a model for this step,” they’re describing a requirement in terms that are specific to a single provider and a single moment in time. “I’ll use Sonnet 4.5 for the analysis step.” But what they actually mean is something more abstract: “I need capable, fast inference for this step — whatever model currently offers the best value at that level.” The model name encodes a bundle of properties — capability, speed, price, context window, modality — that the developer has to unbundle and evaluate manually, every time, across every provider.
This is friction. And in a market where prices dropped 60–80% in 2025 alone, where new models launch monthly and providers change pricing several times a year, that friction is increasingly expensive.
Inference Grades are a proposal for reducing it.
What a Grade Is
An Inference Grade is a minimum specification that classifies a model’s inference along two dimensions: how capable it is, and how fast it delivers a useful response. A grade is a floor, not a bucket — a model graded B-Fast meets or exceeds the B-Fast spec on every dimension. A model that exceeds a higher grade’s spec gets the higher grade. Just as there’s no such thing as “too low sulfur” for crude oil, there’s no such thing as “too fast” or “too smart” for inference.
The capability dimension captures how sophisticated the model’s output is — its reasoning depth, accuracy on hard problems, coding ability, and reliability on complex tasks. We define three tiers:
Tier A — Frontier. Models that achieve top-cluster scores on the hardest available benchmarks. These are the models you use when the task is genuinely difficult — complex multi-step reasoning, novel problem-solving, tasks where the cost of a wrong answer exceeds the cost premium of the model. Opus-class, GPT-5.2+, Gemini 3.1 Pro.
Tier B — Production. Models that handle the vast majority of real-world tasks competently. They may not match Tier A on the hardest reasoning benchmarks, but they’re reliable, fast, and cost-effective workhorses for agent workflows, content generation, structured extraction, and code generation. This is deliberately the broadest tier, because it’s where most inference is purchased. Sonnet-class, Gemini Flash, DeepSeek V3.2.
Tier C — Efficient. Models optimized for speed and cost over capability ceiling. Suitable for classification, routing, summarization, simple extraction, and high-volume batch processing — tasks where the model is good enough and the priority is throughput or cost. Flash-Lite-class, Llama-class, Nano-class.
The speed dimension captures how quickly the model delivers a useful response. We define three classes using two metrics: TT500 (time to deliver 500 tokens, combining latency and throughput) and output speed (tokens per second).
Instant — speed is the product. Output speed ≥ 200 tokens per second. These models power voice agents, realtime pipelines, and high-volume classification. The buyer is selecting for throughput.
Fast — fast enough. TT500 ≤ 30 seconds. The model delivers a substantive response while the buyer is still paying attention. Chatbots, coding assistants, agent workflows. This is what you need when a human or another agent is waiting.
Bulk — latency-tolerant. TT500 > 30 seconds. The buyer is optimizing intelligence per dollar, not speed. Batch APIs, offline analysis, complex reasoning tasks. You use Bulk when the response can wait — whether that’s 40 seconds or 24 hours.
These two dimensions combine into nine grades:
| Instant | Fast | Bulk | |
|---|---|---|---|
| A | A-Instant | A-Fast | A-Bulk |
| B | B-Instant | B-Fast | B-Bulk |
| C | C-Instant | C-Fast | C-Bulk |
When a developer says “I need B-Fast for the orchestration step and C-Instant for the classification pipeline,” everyone in the conversation immediately understands what class of compute is being described — without referencing a specific provider, model name, or pricing page.
A note on the speed axis. The Instant tier was added in Edition 5 after data analysis showed a natural gap in the model population at around 200 tokens per second output speed. Models above this threshold — Gemini Flash-Lite variants, GPT-OSS, Nova Micro — are architecturally distinct from models below it: smaller, highly optimized for throughput, and deployed for use cases where speed is the primary selection criterion. The 200 tok/s boundary is reviewed with each edition and will move as serving infrastructure evolves. The notation remains forward-compatible: any existing reference to “B-Fast” is still valid and still means what it always meant — production-grade capability at interactive speed.
Why Grades Exist Separately from Price
Price is deliberately excluded from the grade definition. A grade tells you what you’re buying. The index tells you what it costs today.
This separation is important for the same reason oil grades are defined by physical properties, not by price. WTI is defined as light sweet crude from specific fields — a specification. Its price moves daily on the exchange, but the definition stays put. If WTI’s definition changed every time the price moved, it would be useless as a reference.
Inference prices are even more volatile. Major providers changed pricing two to four times in 2025. DeepSeek entered the market at prices 10–90x below incumbents. Batch discounts, caching discounts, context-length thresholds, and negotiated enterprise rates mean the “real” price of a model is different from its list price, and different for every buyer.
The grade lets you talk about what you need without coupling it to what it costs right now. “This step needs B-Fast” is a requirement that stays stable even as the cheapest B-Fast model changes from month to month. Your orchestration logic specifies grades; the market tells you which provider currently offers the best value in each grade.
This separation also surfaces things that a pure price table can’t show you. When a model like DeepSeek V3.2 achieves B-tier capability at a price point below most C-tier models, that’s visible precisely because the grade (based on capability) and the price (based on the provider’s business model) are tracked independently. You can see the arbitrage.
The Intelligence Spread
Separating grade from price produces a market indicator worth calling out: the spread between grades — particularly the A/B spread, which measures how much more the market charges for frontier intelligence over production-grade capability. This isn’t just interesting for people who trade inference. It’s a macro signal.
When the A/B spread is wide — frontier inference costing 50x or 100x more than production-grade — the market is telling you that top-tier intelligence is scarce. Only one or two providers can deliver it. The capability gap between the best models and the rest is large. New capabilities exist but haven’t diffused, and the companies that can afford frontier compute have a real edge over those that can’t.
When the spread is narrow — 2x or 3x — frontier capability has been replicated. Multiple providers compete at the top tier, the economic moat around the best models has eroded, and competitive advantage shifts from having access to intelligence to what you do with it.
The trajectory of this spread over time would be, in effect, a quantitative measure of how fast AI capability is commoditizing — a question that today gets answered by vibes and anecdotes. A CEO deciding whether to build or wait, an investor valuing an AI-native company, a CTO choosing between frontier models and the production tier — each of them is implicitly betting on where the A/B spread is heading. The grade system just makes that bet visible.
Why These Two Dimensions and Not Others
A natural objection: inference varies along many more dimensions than capability and speed. Context window matters. Modality matters. Privacy and compliance guarantees matter. Availability SLAs matter. Data sovereignty matters. Why reduce all of this to a 3×3 grid?
Because we’re building a vocabulary, not a specification. The vocabulary needs to be simple enough that people actually use it — in Slack messages, in architecture docs, in pull request comments. “B-Fast” is something a person says to another person.
The two dimensions we chose are the ones that most directly determine whether a model is suitable for a given task. A high-capability slow model and a low-capability fast model are not substitutes for each other, even at the same price. Capability and speed define the fundamental classes of inference demand. Everything else — context window, modality, assurance, availability, sovereignty — varies within a grade and matters for specific use cases, but doesn’t define the class.
The strongest form of this objection isn’t “you need seven dimensions” — it’s “you need three.” The most natural candidate for a third axis is context capacity, since an 8K-context model and a 1M-context model at the same capability tier are not substitutes in any meaningful sense. We take this seriously and nearly included it. But context requirements are more task-specific than capability or speed requirements — a model that needs 200K context for one workflow might need 8K for another, and the same developer will use both in the same system. Capability tier and speed class tend to be properties of a workflow step; context window tends to be a property of a specific call. Including it in the grade would make every grade assignment conditional on the task, which defeats the purpose of a shared vocabulary. We track context capacity as a property of each model listing and expect it to influence model selection within a grade, but not to define the grade itself.
Oil grades work the same way. WTI is defined by density and sulfur content — two dimensions — not by every chemical property of the crude. Other properties (viscosity, pour point, metal content) matter for specific refining applications and are documented in the full assay, but the grade name captures the two properties that most determine what the crude can be used for and what it’s worth. The full assay exists for settlement and certification. The grade name exists so people can talk to each other.
Our full model listings include context window, modality support, and pricing details. The grade is the layer on top of that data.
The Dimensions We Track But Don’t Grade (Yet)
We document the following properties for each model in the index, but they’re not part of the grade label — at least not yet.
Reasoning mode. This is the big one we’re not sure about yet. Models like OpenAI’s o3, DeepSeek-R1, and Claude with extended thinking can allocate variable amounts of internal compute to a single request — “thinking” through a problem before responding. A model running in extended reasoning mode may consume 10–50x more tokens than the same model answering the same question without it, and that additional compute is often what separates A-tier performance from B-tier on hard tasks. The uncomfortable implication: a single model can deliver either A-Bulk or B-Fast depending on a parameter the user controls. We considered making reasoning mode part of the grade — something like B-Fast-R — but decided against it because reasoning modes are still evolving rapidly and lack standardized measurement. What we can say is that any grade assignment must specify whether the model was evaluated with reasoning enabled or disabled, and the index documents this for each listing. As reasoning modes stabilize, this is probably the first dimension to be promoted into the grade label.
Context capacity. The maximum context window available per request. This varies from under 8K tokens to over 1M tokens, and models at the same capability tier may have very different context limits. Context capacity affects pricing non-linearly — Anthropic, for instance, doubles per-token rates above 200K tokens — and it determines what tasks a model can handle in a single call. We track it as a property of each model listing rather than a grade dimension because context requirements are highly task-specific and don’t map cleanly to a small number of tiers.
Modality. What types of input and output the model supports — text only, text plus images, audio, video, tool use, code execution. Multi-modal inference requires more complex model architectures and serving infrastructure, making it more expensive. We track it as a capability property rather than a grade dimension because the current grades are specifically for text inference, which represents the vast majority of API usage. Multi-modal grades may follow.
Assurance level. The privacy, security, and compliance guarantees attached to the inference — ranging from shared infrastructure with minimal guarantees to dedicated tenancy with HIPAA/SOC2/FedRAMP certification to air-gapped sovereign execution. Assurance dramatically affects pricing: dedicated, compliant infrastructure can cost 5–10x more than standard shared endpoints. But assurance is set by the buyer’s regulatory environment, not by the model’s capability, so it belongs in the deployment spec rather than the grade.
Availability. Uptime SLAs and burst capacity. Ranges from best-effort preemptible capacity (the “spot instances” of inference) to 99.99% uptime with contractual penalties. High-uptime guarantees require reserved, dedicated capacity, which is why they cost more — but again, this is about the serving infrastructure, not the model.
Sovereignty. Where the compute physically runs. Data residency regulations (GDPR, PIPL, DPDP Act) make this critical for cross-border procurement. Anthropic already charges a 10% premium for US-only routing. Regional constraints limit competition and create zone-specific pricing, but sovereignty is a constraint on where you buy, not what you buy.
Any of these could eventually become part of the grade label. But adding dimensions before people are comfortable with the basic vocabulary would kill adoption. Better to earn the complexity.
Where This Goes: From Vocabulary to Market Infrastructure
What we’ve defined so far is a vocabulary — a shared set of names for classes of inference. A developer can start using grade names in architecture discussions tomorrow without installing anything or signing up for anything.
But vocabulary is only the first layer of what a commodity market needs. The history of other commodity markets suggests several additional layers that tend to develop over time. We don’t know which of these will materialize for inference. But the vocabulary design has to be good enough to support them if they do.
The assay laboratory: who grades the grades. If grades are to be trusted, they cannot be self-reported. A provider claiming “B-Fast” when independent evaluation shows “C-Fast” would corrode the entire system — just as an oil producer misrepresenting sulfur content would undermine a commodities exchange. Commodity markets solved this with independent assay laboratories: organizations like ASTM International that define the test methods and certify the results, separate from both buyer and seller.
Inference grading needs an equivalent institution. Current capability grades are based on public benchmarks — composite scores from evaluations like GPQA Diamond, SWE-bench, MATH, LiveCodeBench, and the Artificial Analysis Intelligence Index. These are useful but imperfect: benchmarks can be gamed, they don’t reflect real-world task performance, and they’re periodically contaminated. A mature grading system would require an independent evaluation authority that continuously tests models against a standardized task suite — one that is refreshed regularly to resist contamination, that tests via black-box API calls rather than relying on provider self-reporting, and that publishes results transparently. Organizations like Artificial Analysis, LMSYS, and SEAL are early versions of this function. Whether the assay role is filled by one of these, by a standards body, or by a decentralized protocol is an open question — but the function itself is non-negotiable. Without independent measurement, grades are just opinions.
The vintage problem. Inference has a property that oil doesn’t: capability depreciates in relative terms. A model that is Tier A today — the frontier — may be Tier B in twelve months as more capable models arrive. This creates a question for any specification that references grades: does “A-Fast” mean “whatever is frontier at the time of delivery” or “the specific capability level that was frontier when the spec was written”?
The cleanest solution is a vintage suffix: a year code appended to the grade. B-Fast-26 means “B-tier capability as defined by the 2026 benchmark suite, delivered at Fast speed.” A contract can then specify either a fixed vintage (A-Fast-25, pegged to last year’s frontier — now available at a discount as capability has advanced) or a floating grade (A-Fast, always the current year’s tier, at a premium). The notation is still sayable. And the spread between A-Fast-25 and A-Fast-26 directly measures how much new capability entered the market in a year — making capability inflation visible as a price.
This distinction doesn’t matter much today — nobody is writing inference futures. But it matters the moment any buyer signs a contract longer than six months that specifies a grade, and it matters for the framework’s credibility: a grading system that can’t account for the most obvious objection (“but A-tier keeps changing”) isn’t ready for serious use. The vintage suffix shows that the framework has a principled answer, even if the mechanism isn’t needed yet.
Fungibility and substitution. Within a grade, models from different providers should be roughly substitutable — that’s the whole point of the grading system. A buyer who needs B-Fast should be able to use Claude Sonnet 4.6 or Gemini 3 Flash interchangeably for that step, choosing based on price, and get comparable results. In practice, models at the same capability tier have different strengths and weaknesses — one may be better at coding, another at analysis, another at creative writing. The grade captures the general class; the specific model selection within a grade is still a decision the buyer makes. As measurement improves, sub-grades or specialization labels (B-Fast-Code, B-Fast-Analysis) might emerge. But premature granularity is worse than useful coarseness.
Derivatives and forward pricing. In mature commodity markets, the standardized grade enables derivatives: futures contracts, options, and swaps that let buyers hedge against price volatility. An AI-native company spending $500K/month on inference has real exposure to price changes. If B-Fast inference costs $3/MTok today but might cost $6/MTok or $1.50/MTok in six months, the ability to lock in a forward price has genuine value. This requires two preconditions: a standardized grade that defines what’s being traded, and a liquid spot market with transparent pricing. The grades are a step toward the first precondition. Whether the second develops — and whether inference derivatives ever make economic sense — is a question for the market, not for us.
None of the above is a prediction. It’s context for why the vocabulary design matters even if the higher layers never materialize.
Why We’re Publishing This and What We’re Asking For
This framework is a proposal, not a standard. Standards emerge from adoption, not declaration. We’re publishing the grades, the methodology, and the reasoning behind both because we think the inference market needs a shared vocabulary and the only way to find out if this is the right one is to put it in front of the people who would use it.
We expect to be wrong about some of the specifics. Some models are probably misgraded. Some tier boundaries may be in the wrong place. The two-dimension framework may be missing something critical, or it may have the right dimensions but the wrong tier definitions. The speed class thresholds (TT500 ≤ 30 seconds for Fast, ≥ 200 tok/s for Instant) may be in the wrong place.
We want to hear about all of it. The framework gets better when people push back on it. If you think DeepSeek V3.2 belongs in a different grade, make the case. If you’re building voice agents and think the Instant boundary should be 150 or 250 tok/s — tell us where you’d draw the line and why. If you think we’re missing a dimension that’s critical for how you select models, tell us what it is and why.
The best outcome is not that everyone agrees with every grade assignment. It’s that people start using the vocabulary — arguing about whether a model is B-Fast or A-Bulk, debating where the tier boundaries should fall, referencing grades in their own writing and conversations. Disagreement about specifics within the framework is adoption of the framework. That’s the only consensus we need.
So try it. Use “B-Fast” in your next architecture doc. Specify “A-Bulk” in your next agent workflow. See what breaks, see what clicks, and tell us.
You can contribute to the development of Inference Grades in several ways. For grade assignment disputes, methodology discussion, and proposals for new dimensions, contact us at p@karal.no. For informal discussion, questions, and sharing, find us on X/Twitter at @infgrades.