← Back to Index

Methodology

How grades are assigned, where the data comes from, and where the judgment calls are.

How Grades Work

An Inference Grade is a minimum specification, not a classification bucket. A model graded B-Fast is guaranteed to meet or exceed the B-Fast spec on capability, speed, and latency simultaneously. A model that exceeds the spec for a higher grade is assigned the higher grade — so a model that qualifies for both B-Fast and A-Fast is graded A-Fast. A model that fails any dimension of a grade’s spec receives the next grade down that it fully qualifies for.

This is the same logic as commodity grades. A barrel of WTI Light Sweet Crude meets a spec for density and sulfur content. Oil that exceeds the spec is still WTI — there’s no penalty for being better than the minimum. Oil that fails the spec on any dimension is a different grade. And just as there’s no such thing as “too low sulfur” for crude oil, there’s no such thing as “too fast” or “too smart” for inference. A higher-grade model could always be sold as a lower grade if the vendor chose to.

A grade has two parts: a capability tier (A, B, or C) and a speed class (Instant, Fast, or Bulk). These combine into labels like B-Fast or A-Bulk.

Capability: Tiers A, B, and C

Capability tiers are based primarily on the Artificial Analysis Intelligence Index v4.0, an independently measured composite of ten evaluations spanning agents, coding, scientific reasoning, and general knowledge. We use this index because it is run independently (not self-reported by providers), updated regularly, and designed to resist benchmark saturation — the tendency for all frontier models to converge at the top of a test, rendering it uninformative.

Each tier is defined by a minimum Intelligence Index score:

Tier A — Frontier. Intelligence Index ≥ 45. These models represent the current limit of what’s possible — the tasks they handle well are tasks that no model handled well a year ago.

Tier B — Production. Intelligence Index ≥ 28. Models that handle the vast majority of real-world workloads reliably. This is deliberately the broadest tier because it’s where most inference is purchased.

Tier C — Efficient. Intelligence Index below 28. Models optimized for throughput and cost over capability ceiling. Good enough for classification, routing, summarization, and high-volume batch work.

These thresholds are reviewed with each edition and may shift as the Intelligence Index is recalibrated or the score distribution changes. When a model falls near a tier boundary, we look at where it clusters relative to its neighbors rather than mechanically applying the cutoff. If there’s a natural gap in the score distribution — say, a jump from 44 to 46 with nothing in between — we draw the line at the gap. When no gap exists and the call is genuinely ambiguous, we err toward the lower tier. A buyer who discovers their B-graded model performs like an A is pleasantly surprised; the reverse is a worse outcome.

The reasoning mode complication

Many models can run in both reasoning (chain-of-thought) and non-reasoning modes. The same model may score differently in each mode — Claude Sonnet 4.6 scores 52 with adaptive reasoning and 44 without. We grade the model at its best available setting and note the mode. Where this distinction is significant, we list both variants separately.

Speed: Instant, Fast, and Bulk

Speed classes reflect three distinct buyer needs. Rather than measuring raw TTFT and output speed separately — which misclassifies reasoning models whose extended thinking time inflates TTFT without reflecting the actual user experience — we use two composite metrics that capture what the buyer actually experiences.

The metrics

TT500 — Time to 500 tokens. Calculated as TTFT + (500 ÷ output speed in tok/s). This measures how long a buyer waits for a substantive response — roughly a paragraph, a code function, a useful summary. It combines latency and throughput into a single number that reflects the actual experience. A model that thinks for 14 seconds then generates at 55 tok/s delivers 500 tokens in 23 seconds. That’s interactive — the buyer gets a useful answer while they’re still paying attention.

Output speed — tokens per second, sustained. This measures raw throughput, which determines how many concurrent requests a serving instance can handle. It matters most for buyers running high-volume pipelines where per-request latency is acceptable but aggregate throughput is the economic constraint.

The speed classes

Instant. Output speed ≥ 200 tokens per second. The model is fast enough that speed is a product feature, not just acceptable. These models power voice agents, realtime translation, game NPCs, autocomplete, tool calls inside agentic chains, and high-volume classification pipelines. The threshold is set at 200 tok/s because that’s where throughput advantage begins to meaningfully change deployment economics — roughly 2× the output of a typical production model.

Fast. TT500 ≤ 30 seconds and does not meet the Instant spec. The model delivers a substantive response within a time a human would consider interactive. Use cases: chatbots, coding assistants, agent workflows, content generation, interactive analysis. This is the “fast enough” tier — the buyer isn’t agnostic to speed, but won’t go to great lengths to get more of it.

Bulk. TT500 > 30 seconds. The buyer is optimizing for intelligence per dollar, not speed. Whether the model responds in 40 seconds or 24 hours doesn’t change the buying decision. Use cases: batch processing, offline analysis, complex reasoning tasks, any workflow where the API has an SLA measured in minutes or hours. Anthropic’s batch API offers a 24-hour turnaround at 50% off — a buyer choosing that endpoint is making a Bulk decision regardless of how fast the underlying model could run in interactive mode.

The 30-second threshold for Fast reflects a natural gap in the current model data and is reviewed with each edition.

Why not just TTFT and tok/s separately?

Editions 1–4 defined Fast as “TTFT < 1 second and output speed > 80 tok/s.” This broke with the rise of reasoning models. A frontier model that spends 17 seconds thinking before producing output at 44 tok/s has a TT500 of 28.7 seconds — genuinely interactive. But it fails both the TTFT and speed thresholds of the old definition. Conversely, a model with 0.5s TTFT but 7 tok/s output meets the TTFT threshold but delivers a painfully slow experience.

TT500 avoids both problems. It measures the thing the buyer actually cares about: how long until I have a useful answer?

The Instant threshold uses raw tok/s rather than TT500 because at the top of the speed range, TTFT differences are negligible and the buyer’s differentiator is throughput capacity, not response time.

The Grade Specifications

A model is assigned the highest grade it qualifies for on all dimensions simultaneously.

Grade Min Intelligence Index Min Output Speed Max TT500
A-Instant 45 200 tok/s 30s
A-Fast 45 30s
A-Bulk 45
B-Instant 28 200 tok/s 30s
B-Fast 28 30s
B-Bulk 28
C-Instant 200 tok/s 30s
C-Fast 30s
C-Bulk

“—” means no minimum or maximum required. Every model qualifies for at least C-Bulk.

What We Track But Don’t Grade

Price. The grade covers capability and speed, not cost. A B-Fast model at $0.04/MTok and a B-Fast model at $3.00/MTok are the same grade — the buyer comparison within a grade is primarily on price. Prices in the index reflect standard on-demand rates from the model’s first-party API. Batch discounts, prompt caching discounts, long-context surcharges, and negotiated enterprise rates are excluded because they depend on the buyer’s usage pattern.

For open-weight models available from multiple providers, we show the first-party API price where one exists. Where no first-party API exists, we note the cheapest widely available provider.

Context window. Ranges from 8K to 10M tokens across the index. Noted per model because context requirements are task-specific — the same developer may need 1M context for one workflow step and 8K for another.

Modality. Input/output types supported: text, image, audio, video, code execution. The current grades apply to text inference.

Open weights. Whether the model weights are publicly available. Affects deployment options and pricing.

Reasoning mode. Whether the model uses extended chain-of-thought. Noted because it dramatically affects both capability and cost per request.

Data Sources

All Intelligence Index scores are independently measured by Artificial Analysis. Self-reported scores are excluded. Speed measurements (TTFT, output tokens per second) come from Artificial Analysis, measured on the model’s first-party API over a rolling 72-hour window. Pricing comes from official provider pricing pages, checked with each edition.

We measure against first-party APIs (e.g., OpenAI for GPT models, Anthropic for Claude, Google for Gemini). Third-party hosting providers may deliver different speed characteristics for the same model — typically faster for open-weight models on specialized infrastructure, sometimes slower under contention. Our speed class reflects the canonical serving environment, not the best or worst available.

Edition Cadence

The index is published in numbered editions, triggered by major model launches, significant pricing changes, or accumulated minor updates. Tier boundaries and speed thresholds are reviewed with each edition and may be adjusted. The vintage notation (e.g. B-Fast-28 for “B-Fast as defined in Edition 28”) ensures that grades remain meaningful even as thresholds evolve.

What We Get Wrong

The Intelligence Index is a generalist measure. A model that scores B-tier on the composite may be A-tier for coding specifically, or C-tier for creative writing. The grade captures general capability, not task-specific performance. Buyers with narrow use cases should look at individual benchmark scores, not just the composite.

Speed measurements are point-in-time and load-dependent. A model that’s Fast today may slow under heavy demand. We use rolling averages to smooth this, but transient congestion can affect classifications near the boundary.

TT500 assumes a 500-token response, which is a reasonable average for interactive use. For very short outputs (tool calls, classifications), the metric underweights TTFT. For very long outputs (full documents), it underweights sustained throughput. The metric is a useful proxy, not a perfect one.

We grade models at their best available speed from the first-party API. A buyer who runs a model through a batch API at half price is voluntarily operating below the model’s graded speed class — the grade tells you what the model can do, not what you choose to use it for.

Tier boundaries are fuzzy by design. Close calls are inevitable and some of them will be wrong. When in doubt, we grade lower.

If you think we’ve graded a model wrong, we want to hear about it. Contact us at p@karal.no.

← Back to Index