DataBubble·

AI Glossary

·35 terms

The vocabulary
behind the rankings.

Databubble tracks AI models across dozens of benchmarks, pricing dimensions, and adoption signals. This page collects the terms that show up most often in our model pages, comparison views, and news feed — written for engineers, researchers, and decision-makers who want a clear and accurate definition, not marketing language. Definitions are grouped into five categories below and updated as the field moves.

[10] Benchmarks & Evaluation

Standardised tests we use to compare model capability.

AlpacaEval: AlpacaEval is an instruction-following benchmark that compares a model's responses to those of a fixed reference model using an LLM-based judge. The headline number is winrate — the fraction of prompts on which the candidate is preferred over the reference. A length-controlled variant (LC winrate) discounts the well-documented bias toward longer responses, giving a cleaner comparison. Alpaca scores correlate reasonably well with chat-arena rankings at a fraction of the cost, which is why most labs publish them alongside formal benchmarks.
BBH (Big Bench Hard): BBH is a curated subset of 23 tasks pulled from the broader BIG-Bench suite. The tasks were selected because frontier language models, at the time, performed at or below the human average on them. They span multi-step reasoning, planning, logical deduction, and word manipulation — areas where simple next-token prediction tends to fall apart. Strong BBH performance is now a baseline expectation for any model that markets itself as a reasoning model, and chain-of-thought prompting reliably lifts scores by 10-30 percentage points.
GPQA: GPQA stands for Graduate-level Google-Proof Q&A. It contains 448 multi-choice questions in physics, biology, and chemistry, each written and validated by domain PhDs. The questions are designed so that a non-expert with unrestricted internet access still cannot reliably answer them — which makes the benchmark resistant to retrieval shortcuts and to simple memorisation of public web data. The Diamond subset of 198 questions is where the leaderboard race actually happens; scores above 60 percent are considered frontier-class as of 2026.
IFEval: IFEval evaluates a model's ability to follow precise, machine-verifiable instructions — output exactly N words, include a specific keyword, end the response with a given phrase, format the answer as JSON, and so on. Because every constraint can be checked programmatically, IFEval scores avoid the noise of LLM-as-judge benchmarks and are reproducible to the digit. A high IFEval score is a stronger signal of agent reliability than chat-arena ELO, which is why agent-product teams weight it heavily when picking a base model.
Intelligence Index (Artificial Analysis): The Intelligence Index is a composite score published by Artificial Analysis that aggregates several public benchmarks — typically MMLU, GPQA, MATH, HumanEval, and a handful of reasoning evals — into a single 0-100 number. The index is normalised against a frontier reference, so a score of 70 means roughly 70 percent of frontier-tier capability across the underlying suite. It is a convenient single-number summary for buyers comparing dozens of models, but no composite captures the full picture: always check the underlying benchmark scores when the use-case is narrow.
LMSYS Chatbot Arena ELO: Chatbot Arena, run by LMSYS, is a blind pairwise preference platform where users prompt two anonymous models side by side and vote for the better answer. Votes are aggregated into ELO ratings, the same system chess uses, where a 100-point gap implies the higher model wins about 64 percent of head-to-head matchups. Because the prompts come from real users rather than fixed test sets, Arena ELO is the closest thing the industry has to a real-world quality measure — but it conflates capability, style, and verbosity, so use it alongside task-specific benchmarks.
Math (V2): The MATH dataset, in its second version, is a corpus of 12,500 competition-style mathematics problems sourced from AMC, AIME, and similar high-school olympiad contests. Each problem ships with a fully worked solution, so models can be evaluated either on final-answer accuracy or on the quality of the chain of reasoning. MATH is a stronger signal than grade-school benchmarks like GSM8K because the problems require multiple proof steps and creative algebraic manipulation; reasoning-tuned models can score above 90 percent, while general chat models tend to land in the 50-70 percent range.
MMLU-Pro: MMLU-Pro is the harder, cleaned-up successor to the original Massive Multitask Language Understanding benchmark. It contains around 14,000 multi-choice questions across 14 subjects — law, medicine, engineering, philosophy, history, and more — with the answer set expanded from four to ten options to discourage guessing. Many of the noisiest and most ambiguous questions from the original were rewritten or removed. Frontier models score in the high 70s and low 80s; a 5-10 point lead on MMLU-Pro is a meaningful capability gap.
MUSR: MUSR — Multistep Soft Reasoning — is a benchmark of narrative reasoning tasks built around murder mysteries, object placements, and team allocations. Solving each puzzle requires the model to track entities across several paragraphs of natural-language story, combine soft constraints, and rule out incorrect candidates by elimination. Unlike MATH or HumanEval, the answers cannot be derived through symbolic manipulation alone — the test is whether the model can actually reason about a story. MUSR is one of the components in the Open LLM Leaderboard composite.
SWE-bench: SWE-bench measures whether a coding agent can resolve real GitHub issues. The original suite draws from 12 popular Python repositories: each task pairs an issue description with the repository state at the time and asks the model to generate a patch that passes the project's own hidden test suite. Several variants exist — bash-only, Verified, Lite, Multilingual, Test, and Multimodal — each scoping the problem differently. Databubble tracks the resolve-rate across all six leaderboards and keeps the highest figure per model.

[10] Architecture & Modeling

How modern AI models are built and adapted.

Context window: The context window is the maximum number of tokens a model can attend to in a single forward pass — its working memory. A 128,000-token context window corresponds to roughly 96,000 English words, or several short books. Longer windows enable workflows like whole-codebase reasoning and long-document Q&A without retrieval, but the compute cost of vanilla self-attention is quadratic in sequence length, which is why long-context models invest heavily in sparse-attention, sliding-window, or memory-token tricks. Effective long-context performance can lag the nominal window: many models advertise 128k but degrade past 32k.
Fine-tuning: Fine-tuning is supervised training on a labeled task-specific dataset performed after the base pretraining run. Full-parameter fine-tuning updates every weight in the model and is expensive at frontier scale; parameter-efficient methods like LoRA and QLoRA train only a low-rank adapter, cutting memory and storage requirements by an order of magnitude or more. Fine-tuning is appropriate when retrieval and prompt engineering have run out of headroom — typically for narrow domains, persistent style or formatting needs, or proprietary terminology that the base model never saw.
MoE (Mixture of Experts): A Mixture of Experts is a sparse architecture in which each token is routed to a small subset of expert sub-networks rather than the full model. A model with 256 experts and a top-2 router activates roughly two of them per token, so total parameter count and inference compute decouple: an MoE can have hundreds of billions of total parameters while only spending the FLOPs of a model an order of magnitude smaller per token. The trade-off is memory — every expert still has to live on accelerator RAM — and routing complexity, which can hurt latency at low batch sizes.
Parameters: Parameters are the trainable weights of a model — the numbers that get adjusted during training. Counts are reported in B for billions, occasionally in T for trillions. Memory footprint scales linearly with both parameter count and per-weight precision: a 70B model occupies about 140 GB at fp16 (two bytes per parameter), 70 GB at int8, and 35 GB at int4. Parameter count alone is a poor predictor of capability — a well-trained 30B can comfortably beat a poorly-trained 70B — but it sets a floor on how much hardware you need to host inference.
Quantization: Quantization is the practice of reducing the numerical precision of model weights, typically from 32-bit or 16-bit floats down to 8-bit, 4-bit, or even lower. The quality cost is small for well-trained models — int8 typically loses under one percent on most benchmarks, int4 a few percent — while the memory and bandwidth savings let large models fit on consumer hardware. Modern variants such as GPTQ, AWQ, and fp8 use calibration data and per-channel scaling to preserve accuracy further. Most open-weights releases now ship with several pre-quantized variants alongside the original.
RAG (Retrieval-Augmented Generation): RAG combines a retrieval step with generation: the system fetches relevant documents from a knowledge base — usually via vector search over embeddings — and concatenates them into the prompt before the model generates a response. This lets a frozen base model answer questions about private or post-cutoff data, and gives the application a clear point at which to enforce citations and freshness guarantees. RAG is often a better first move than fine-tuning when the knowledge changes frequently or when grounded sourcing is a product requirement.
RLHF / DPO: RLHF — Reinforcement Learning from Human Feedback — is the alignment recipe that turned raw GPT-style language models into chat assistants. A reward model is trained on human preferences over response pairs, then the base model is optimised against that reward, classically with PPO. DPO — Direct Preference Optimisation — collapses both stages into a single supervised loss that operates directly on preference pairs, which is more stable and far cheaper to run. Many recent open models use DPO, and several use DPO augmented with on-policy or model-judged preferences.
Tokenizer: A tokenizer is the deterministic algorithm that maps raw text to and from token IDs. The two dominant families are Byte-Pair Encoding (BPE) — used by GPT-style models and Llama — and SentencePiece, which is used by many multilingual systems. Tokenizer choice has real downstream consequences: a vocabulary tuned for English will spend many more tokens to encode Mandarin or code than one tuned for those domains, which directly affects pricing, throughput, and effective context window for non-English workloads.
Tokens: Tokens are the sub-word units a language model actually consumes and produces. A common rule of thumb is that one token equals about 0.75 English words on average, though the ratio depends on the tokenizer and the language. A 1,000-word article will typically be encoded as about 1,300 tokens. Because every commercial API meters and prices in tokens — both input and output — token economics, not word counts, drive both cost forecasting and rate-limit planning.
Transformer: The transformer is a neural-network architecture introduced in 2017 that replaced recurrence with self-attention — a mechanism allowing every position in a sequence to attend directly to every other position. Self-attention parallelises efficiently on modern accelerators, which is why the architecture scaled from a few hundred million parameters to the trillion-parameter regime in under a decade. Almost every modern frontier model — language, vision-language, speech, code — is a transformer or a close variant such as a state-space hybrid.

[05] Pricing & Performance

How model APIs are billed and how their speed is measured.

Input tokens / output tokens: Commercial APIs bill separately for tokens fed into the model (input or prompt tokens) and tokens generated by the model (output or completion tokens). Output tokens are usually three to five times more expensive than input tokens, because every output token requires its own forward pass while input is processed in parallel. For long-prompt, short-answer workloads — classification, extraction, summarisation — total cost is dominated by input. For long-form generation it flips. Get this right before extrapolating monthly bills from a back-of-the-envelope estimate.
Latency: Latency is the end-to-end response time of a request — the wall-clock duration from the moment the client sends the prompt to the moment the last token of the response arrives. It depends on prompt length, output length, the throughput of the serving infrastructure, the size and speed of the model, and any safety filtering applied to the response. Latency and throughput are coupled: a provider can serve more concurrent users by batching, but at the cost of slightly higher per-request latency. Always measure latency under realistic load, not idle.
Price per 1M tokens: Per-million-token pricing is the industry-standard quoting unit for API access. As of writing, prices range from about $0.10 per million for small open-weights models served by commodity inference providers, up to roughly $15 per million tokens of output for the largest frontier models. The same model can have an order-of-magnitude price spread across providers, depending on hardware, batching strategy, and whether the operator is subsidising adoption. Always compute blended cost with your own input/output token mix — published headline numbers can be misleading.
TPS / TPM: TPS — tokens per second — measures generation throughput once streaming has begun. A typical chat model on a single H100 might emit 60-200 TPS for a single user; aggregate TPS for a fleet can be much higher under batching. TPM — tokens per minute — is the standard rate-limit unit on commercial APIs: a 200,000 TPM tier means your account may consume up to 200,000 tokens (input plus output) per rolling minute before requests start returning 429 errors. Both numbers are cited per model and tier.
TTFT (Time to first token): TTFT is the latency from request submission to the moment the first token of the response arrives at the client. For interactive UX — chat, autocomplete, voice — TTFT matters more than total latency, because the user perceives the response as having started once any text appears. Sub-300ms TTFT feels instant; above one second feels sluggish. Reducing TTFT typically requires a smaller or distilled model, more aggressive prompt-caching, geographic edge deployment, or all three.

[05] Trending & Tracking

How Databubble measures and ranks models.

Arena ELO: Within Databubble, Arena ELO is the LMSYS Chatbot Arena rating used as a quality signal in rankings, comparison views, and the home-page bubble chart. Because ELO is grounded in real human preference votes rather than fixed test sets, it is harder to game and tracks lived user experience more closely than synthetic benchmarks. We re-import Arena snapshots regularly and surface the rating both in the model detail page and in the Mentioned models panel of news articles.
Databubble score: The Databubble score is our composite rank — a weighted blend of six signals including downloads, trending velocity, Arena ELO, benchmark composites, GitHub activity, and recency. We publish the exact weights and methodology so the number is auditable rather than a black box. See methodology for the full formula. The score is a useful single-number ranking when you want a quick read across categories, but for any narrow decision you should dig into the underlying components.
Day / Week / Month change: These are the percent deltas in download count compared with the value N days ago, computed from our daily snapshots of the HuggingFace counter. Day change captures launch-week heat and viral spikes; week and month change smooth out daily noise and surface durable momentum. A model with a +400 percent week change is almost always either a brand-new release or one that has been picked up by a major framework or a viral demo. Older numbers get noisy when the absolute download base is small, so treat early-stage spikes as directional rather than precise.
Downloads: Downloads is the cumulative HuggingFace download counter for a model. The number is updated approximately hourly upstream and includes every fetch via the HuggingFace API as well as raw git clone calls against the repo. It is not a unique-user count: an automated CI pipeline that pulls the model on every test run will inflate the number. That said, downloads remain the closest single signal of real-world adoption for open-weights models, and the relative ranking across millions of models is highly informative even when absolute numbers are noisy.
Trending score: Trending score is HuggingFace's internal short-window popularity measure — a proprietary blend of likes, downloads, and discussion activity over the recent past, weighted toward velocity rather than absolute volume. We surface it on Databubble because it catches sub-week movement that the slower download counter misses: a model can become the top trending entry on HuggingFace within hours of a high-profile launch, well before its cumulative download count reflects the surge.

[05] Modalities & Capabilities

How modern models extend beyond text.

Code-specialized: Code-specialized models are language models trained or fine-tuned predominantly on source-code corpora — git-mined repositories, programming-language documentation, package registries, and curated problem sets. Examples include Code Llama and DeepSeek Coder. They tend to outperform general chat models on completion, repair, and HumanEval-style benchmarks for an equivalent parameter budget, though the gap has narrowed as frontier general models have absorbed more code into their pretraining mix. They are still the practical choice for self-hosted IDE assistants where capability per parameter matters.
Embedding model: An embedding model maps a piece of input — text, image, or audio — to a fixed-size dense vector that captures its semantic content. Inputs that mean similar things end up close together in the vector space, which is what makes nearest-neighbour search work for retrieval, clustering, and de-duplication. Embeddings power the retrieval half of any RAG pipeline and underlie most modern semantic-search products. They are dramatically smaller and cheaper to run than chat models, and a high-quality embedding stack often delivers more product value than a marginal upgrade to the generative model on top.
Multimodal: A multimodal model accepts or emits more than one type of data — combining text with images, audio, or video. The most common variant today is text-and-image input with text output. True multimodal training, where the model sees mixed-modality data during pretraining rather than bolting a vision encoder onto a frozen language model, tends to produce stronger cross-modal reasoning. The frontier is moving toward unified models that natively handle text, images, audio, and even structured outputs like 3D and code in a single architecture.
Reasoning model: Reasoning models are explicitly trained or scaffolded to allocate more inference compute to multi-step thinking before producing a final answer. Examples include OpenAI's o-series, DeepSeek R1, and the Claude Thinking variants. Internally they generate a long, often hidden chain of thought during which they explore alternatives, rule out dead-ends, and check intermediate steps. The result is significantly higher accuracy on math, coding, and competition-style benchmarks at the cost of substantially higher per-query latency and token spend. Use them where correctness matters more than time-to-first-token.
Vision-language: Vision-language models accept text and images as input and produce text as output. They are the workhorse of OCR, document understanding, screenshot parsing, chart reading, and any task where the input is too visually structured for plain text extraction. Architecturally they couple a vision encoder — frequently a ViT — to a language model via a small projection layer; some recent designs replace the discrete encoder with native patch tokenisation. GPT-4o vision and the Claude vision-capable line are typical examples.

The Bubble Brief

WEEKLY

Apply this knowledge — get the weekly brief on Tuesday 13:00 UTC.

The vocabulary
behind the rankings.