DataBubble
  • Home
  • Models
  • News
  • Compare
  • Boards
  • Pricing
  • About
  • Newsletter
  • Methodology
  • Contact
Latest
Life After Benchmark Saturation: A Case Study of CORE-Bench1h◆Clinical Harness for Governable Medical AI Skill Ecosystems1h◆OpenRCA 2.0: From Outcome Labels to Causal Process Supervision1h◆TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference1h◆Localizing RL-Induced Tool Use to a Single Crosscoder Feature1h◆GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning1h◆Beyond Global Divergences: A Local-Mass Perspective on Bayesian Inference1h◆Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models1h◆Tuning Language Models by Mixture-of-Depths Ensemble1h◆Rotary Position Encodings for Graphs1h◆An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models1h◆When Role-playing, Do Models Believe What They Say?1h◆Multilingual Reasoning Cascades Need More Context1h◆Federated Hash Projected Latent Factor Learning1h◆A probabilistic framework for online test-time adaptation1h◆Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis1h◆No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference1h◆Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking1h◆The Verification Horizon: No Silver Bullet for Coding Agent Rewards1h◆AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs1h◆Life After Benchmark Saturation: A Case Study of CORE-Bench1h◆Clinical Harness for Governable Medical AI Skill Ecosystems1h◆OpenRCA 2.0: From Outcome Labels to Causal Process Supervision1h◆TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference1h◆Localizing RL-Induced Tool Use to a Single Crosscoder Feature1h◆GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning1h◆Beyond Global Divergences: A Local-Mass Perspective on Bayesian Inference1h◆Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models1h◆Tuning Language Models by Mixture-of-Depths Ensemble1h◆Rotary Position Encodings for Graphs1h◆An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models1h◆When Role-playing, Do Models Believe What They Say?1h◆Multilingual Reasoning Cascades Need More Context1h◆Federated Hash Projected Latent Factor Learning1h◆A probabilistic framework for online test-time adaptation1h◆Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis1h◆No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference1h◆Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking1h◆The Verification Horizon: No Silver Bullet for Coding Agent Rewards1h◆AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs1h◆
DataBubble·

Methodology

·Last updated 2026-04-30
01Overview
WHY A COMPOSITE

Every page on Databubble — the bubble chart, the rankings rail, the model detail card — is ordered by a single number we call databubble_score. It is a 0-100 composite, recomputed across the entire model universe once a day, that reflects how a given AI model is doing along six independent dimensions at once.

We built it because every single-metric leaderboard misleads. Sort by downloads and the top of the list is tiny embedding models from RAG tutorials. Sort by trending and you see whatever was released yesterday. Sort by MMLU and the front page fills with leaderboard-hunting fine-tunes nobody uses in production. Each ordering is technically correct and substantively misleading, because no single signal fully captures what people mean when they ask "is this model a big deal right now?"

The composite is opinionated. We chose six signals, fixed weights, and a uniform normalization. None of these choices are neutral, and the rest of this page exists to make them legible. We do not run our own benchmarks; we aggregate signals published by HuggingFace, LMSYS, the Open LLM Leaderboard maintainers, the SWE-bench team, Semantic Scholar and several pricing-and-throughput providers, and link back to each source so any number can be verified without trusting our pipeline.

If you only read one section, read Limitations & trade-offs — a composite hides almost as much as it reveals, and we want users to know what the score does not say.

02The six signals
6 INPUTS

Each signal answers a different question. Together they cover novelty, adoption, human preference, academic capability, agentic skill and confidence-of-measurement. For each: what it measures, source, normalization, why this weight, and what it correlates with.

01
trending_score
weight 25%

What it measures. A short-horizon momentum signal. Combines very recent download velocity, like activity and discussion volume on the Hub.

Source. HuggingFace public Models API at https://huggingface.co/api/models, sorted by trendingScore. We pull the top 300 models every two hours.

Normalization. Raw values are heavy-tailed (a viral release can spike orders of magnitude above the median). We min-max normalize across the active universe so the 99th-percentile model maps to 100 and the global minimum maps to 0.

Why this weight. It is the only signal that captures novelty within hours. New releases earn most of their score here in the first 48 hours.

What it correlates with. Strongly with day_change in the very short term, but decays fast within roughly two weeks.

02
arena_elo
weight 25%

What it measures. Pairwise human preference, expressed on the Bradley-Terry / Elo scale popularized by LMSYS Chatbot Arena. Higher Elo = users prefer it head-to-head.

Source. LMSYS Chatbot Arena leaderboard, accessed through the api.wulong.dev mirror at /arena-ai-leaderboards/v1/leaderboard?name=text. Refreshed twice daily.

Normalization. Min-max across all models with at least one Elo score on file. The Elo scale is bounded in practice (currently roughly 900-1450), so the linear mapping is faithful.

Why this weight. Quality-of-output matters as much as adoption. We want a high score to actually reflect that, not just download volume.

What it correlates with. Loosely with avg_benchmark and very loosely with downloads. Some highly-downloaded fine-tunes have weak Elo, and some niche models post strong Elo with low adoption.

03
avg_benchmark
weight 20%

What it measures. A composite quality score across six standardized academic benchmarks: MMLU-Pro, BBH, GPQA, IFEval, MATH Level 5 and MUSR.

Source. HuggingFace Open LLM Leaderboard v2 contents dataset, fetched from datasets-server.huggingface.co. Each underlying score (ifeval_score, bbh_score, gpqa_score, mmlu_pro_score, math_score_v2, musr_score) is stored individually; avg_benchmark is the arithmetic mean computed by the leaderboard maintainers.

Normalization. Min-max across all models that have a non-null avg_benchmark value (not all open-weights models are on the leaderboard, and proprietary endpoints never are).

Why this weight. Benchmarks catch problems Arena does not — long-context reasoning (BBH), graduate-level science (GPQA), strict instruction-following (IFEval). A 20% weight differentiates quality without letting leaderboard hunters dominate.

What it correlates with. Strongly with arena_elo at the top of the distribution but the two diverge sharply in the middle: models can ace MMLU-Pro and lose pairwise comparisons because reviewers care about tone, brevity and refusal behavior that benchmarks ignore.

04
day_change
weight 15%

What it measures. 24-hour percentage change in cumulative downloads, computed against a daily snapshot table.

Source. Internal ai_model_snapshots table, written by /api/ingest/snapshot at 1 AM UTC. /api/ingest/compute-trends then computes day_change as ((today − yesterday) / yesterday) × 100.

Normalization. Min-max with the same procedure as the other signals, but day_change is signed — a model losing downloads (negative day_change) maps below the midpoint after normalization, not to zero.

Why this weight. Trending captures attention only for the newest releases. day_change captures sustained adoption: a six-month-old model that takes off because a new tool integrates it should rise.

What it correlates with. With trending_score on day 1, then diverges. A model can post strong day_change for weeks after it stops "trending" if it has crossed an adoption threshold.

05
swe_bench_resolved
weight 10%

What it measures. Best resolve rate (% of issues fixed) achieved by the model across all six SWE-bench leaderboards: Bash-only, Verified, Lite, Multilingual, Test and Multimodal.

Source. The maintainer-published leaderboards.json at github.com/SWE-bench/swe-bench.github.io. We iterate every leaderboard and take the maximum resolve rate per model so a model is credited for its best showing.

Normalization. Min-max across models with any SWE-bench entry. Many models — vision, audio, embedding — have no SWE-bench entry, in which case the signal is skipped and its weight redistributes (see the formula section below).

Why this weight. SWE-bench is the most credible "agentic coding" public benchmark. A 10% weight rewards models good at real-world software engineering without overfitting the leaderboard to coding-only models.

What it correlates with. With avg_benchmark via shared coding ability, but loosely. Tool use, file editing and test execution are skills above and beyond chat preference.

06
arena_votes
weight 5%

What it measures. Total number of pairwise comparisons cast against this model in Chatbot Arena. A confidence-volume signal, not a quality signal.

Source. Same LMSYS feed as arena_elo, taken from the n_votes field on each leaderboard entry.

Normalization. Min-max in log-space-equivalent fashion (the linear min-max compresses the long tail of low-vote entries, but that is intentional: we want the top-volume models to be clearly distinguished from boutique entries).

Why this weight. A model with 50,000 votes and an Elo of 1300 is a more trustworthy claim than a model with 200 votes and an Elo of 1320. Five percent is enough to break ties between similarly-rated models without overweighting popularity.

What it correlates with. Strongly with arena_elo for established flagships (more deployment → more comparisons), weakly with anything else. New entrants always start with low arena_votes regardless of quality.

03Composite formula
WEIGHTED MIN-MAX

The score is a weighted average of min-max-normalized signals. For a single model m, with the set of signals where the value is non-null written as S(m), the formula is:

databubble_score(m) =
  ( Σ over s ∈ S(m) of  weight[s] × normalized[s] )
  ─────────────────────────────────────────────────
  ( Σ over s ∈ S(m) of  weight[s] )

where  normalized[s] = ((value[s] − min[s]) / (max[s] − min[s])) × 100
       min[s], max[s] are computed across the entire active model set
       weight[s] is the signal's published percentage (25, 25, 20, 15, 10, 5)

Two implementation details matter. First, min and max are recomputed every run from the live model set, not held as historical constants. A new flagship that pushes the maximum Elo from 1450 to 1500 implicitly rescales every other model's normalized Elo downward — a feature, since the score is meant to be relative to the current frontier.

Second, weights are redistributed when a signal is missing. A model with no SWE-bench entry does not get a 0; instead, the 10% weight drops out of both numerator and denominator. This prevents partial-coverage bias against models not evaluated on every available benchmark.

Edge case: if every signal is null, or every available signal has min == max (zero variance), the score is reported as 0 rather than a divide-by-zero error. In practice this only happens for stub records during ingestion catch-up.

The implementation is small enough to be read in one sitting: frontend/src/app/api/ingest/compute-trends/route.ts. The function computeRanges computes the per-signal min/max in a single pass; computeDatabubbleScore applies the weighted formula above and rounds the result to one decimal place.

04Worked example
EXAMPLE — NOT REAL DATA

To make the formula concrete, here is a hypothetical model. All numbers are illustrative round figures, not the actual values for any real release. Read this as "what if a frontier-class release scored roughly here."

Hypothetical model: a chat-oriented LLM, a few weeks post-release, evaluated on every public leaderboard.

Raw values (example, not real data)
  trending_score      = 480       (top quartile but past the launch spike)
  arena_elo           = 1320      (strong, well below the absolute frontier)
  avg_benchmark       = 60        (open-llm-leaderboard composite, 0-100 scale)
  day_change          = 8         (8% more downloads vs 24h ago)
  swe_bench_resolved  = 35        (35% issues solved on the best of 6 boards)
  arena_votes         = 18000

Active universe ranges (example)
  trending_score      [0, 1500]
  arena_elo           [900, 1450]
  avg_benchmark       [0, 80]
  day_change          [-50, 200]
  swe_bench_resolved  [0, 75]
  arena_votes         [0, 80000]

Normalized signals (0-100)
  trending_score      = (480 − 0) / (1500 − 0)  × 100 = 32.0
  arena_elo           = (1320 − 900) / (1450 − 900) × 100 = 76.4
  avg_benchmark       = (60 − 0) / (80 − 0)   × 100 = 75.0
  day_change          = (8 − (−50)) / (200 − (−50)) × 100 = 23.2
  swe_bench_resolved  = (35 − 0) / (75 − 0)   × 100 = 46.7
  arena_votes         = (18000 − 0) / (80000 − 0) × 100 = 22.5

Weighted sum
  numerator   = 25 × 32.0 + 25 × 76.4 + 20 × 75.0 + 15 × 23.2 + 10 × 46.7 + 5 × 22.5
              = 800 + 1910 + 1500 + 348 + 467 + 112.5
              = 5,137.5
  denominator = 25 + 25 + 20 + 15 + 10 + 5 = 100

databubble_score = 5,137.5 / 100 = 51.4

Two things to notice. First, arena_elo and avg_benchmark do most of the heavy lifting — together they account for roughly (1910 + 1500) / 5137.5 ≈ 66% of the final score. That is by design. Second, despite a strong Elo, the score lands in the mid-50s because trending and adoption signals are middle of the pack. A score in the 80s is reserved for models that are simultaneously high-quality and being actively adopted.

If this hypothetical model had no SWE-bench entry, the 10% weight would drop out and the denominator would be 90. The new score would be (5,137.5 − 467) / 90 ≈ 51.9. Almost identical — which is what weight redistribution is supposed to deliver.

05Update cadence
11 SOURCES

Every signal is on its own ingestion cron, scheduled inside the production container via Coolify. The schedules below are the actual production cadence as of this writing. Source code lives under /api/ingest/*; the schedule itself is in frontend/docs/cron-schedule.md.

SourceFields populatedFrequency
HuggingFace Models APIid, name, downloads, likes, trending_score, parameters, license, pipeline_tag, library_name, tags, gated, downloads_all_time, param_bytesevery 2h
HuggingFace cardData.tags (arxiv:NNNN.NNNNN)arxiv_id, paper_urlevery 2h (with trending pull)
ArtificialAnalysis APIintelligence_index, coding_index, math_index, speed_tps, latency_ms, price_input, price_output, context_window2x daily
OpenRouter APIprice_input, price_output (proprietary fallback)daily
Together AI catalogprice_input, price_output (hosted open-source fallback)daily
LMSYS Chatbot Arena (api.wulong.dev mirror)arena_elo, arena_votes2x daily
HuggingFace Open LLM Leaderboard v2ifeval_score, bbh_score, gpqa_score, mmlu_pro_score, math_score_v2, musr_score, avg_benchmark2x daily
SWE-bench leaderboards.json (6 boards)swe_bench_resolved (max across boards)weekly
Semantic Scholar Graph APIcitation_count, influential_citations (looked up by arxiv_id)weekly
GitHub REST APIgithub_stars2x weekly (Mon + Thu)
AlpacaEval leaderboardalpaca_winrate, alpaca_lc_winrateweekly

The composite itself is recomputed once a day, at 1 AM UTC, after the daily snapshot is written. That means a freshly-ingested benchmark score at 2 PM UTC will not affect databubble_score until the next overnight recompute — by design, so the score does not jitter intraday.

06Limitations & trade-offs
READ ME

No score this compact is right for everyone. The composite is meant to surface "models worth paying attention to right now," not to settle benchmark debates. Here are the trade-offs we have made and the ways the score can mislead.

Trending vs. quality, again. Even with quality signals at 45% (25% Elo + 20% benchmark), a viral release can briefly score higher than a more capable but quieter model. The score is not "best model"; it is "model that is a big deal right now," and what people are using is a legitimate part of that.

English-language Arena bias. LMSYS Chatbot Arena is overwhelmingly English-language general-chat. Models excellent at non-English, code-only or long-form structured tasks can be underrated by Elo. The benchmark composite (with MUSR and BBH) and swe_bench_resolved partially compensate, but for niche use-cases the underlying signals are more informative than the composite.

Benchmark gaming. The Open LLM Leaderboard composite is robust by construction (six contamination-resistant tasks), but no benchmark is uncheatable. Fine-tuning on a leaked answer key or training on the public dev split produces inflated scores. The 25% Arena weight is a partial defense: a model that aces benchmarks but loses pairwise comparisons rapidly falls out of the top.

Missing-data redistribution can flatter sparse models. Redistribution stops partial-coverage models from being zeroed-out, but a model with only one strong signal (say, a high trending_score and nothing else) can read higher than is warranted. We require multiple non-null signals before a model appears in headline rankings, but it is worth noting when comparing a brand-new release to an established one.

No multimodal coverage in the score. Image, audio and video models live on Databubble, but none of the six composite inputs measure multimodal quality directly. Their scores end up driven primarily by trending_score and day_change, which is honest about the data we have and dishonest about the actual quality of the model. This is the single biggest known gap.

Proprietary models lack one signal universally. Closed models (e.g. proprietary commercial APIs) have no HuggingFace trending_score, so for them the 25% weight always redistributes. Their composite is anchored more strongly on Arena Elo and benchmarks. We think this is the right call — a model with no public Hub presence cannot have a Hub trending value — but it is a structural asymmetry users should know about.

Min-max normalization is sensitive to outliers. A single anomaly at the top can compress the rest of the distribution. We accept this because rank- or percentile-based alternatives make the score unstable as the universe size changes, which it does daily.

07Future signals
ROADMAP

The composite is versioned (databubble_score v1, as of this writing) and we expect to revise it. Concrete signals on the candidate list, in rough order of priority:

Multimodal evaluation. A vision-and-language counterpart to swe_bench_resolved, drawn from public multimodal leaderboards (MMMU, MathVista, MMVet, equivalents for audio and video). Without this, the composite genuinely under-serves image/audio/video models.

Agentic / tool-use evaluation. Beyond SWE-bench, results from agent benchmarks like GAIA, AgentBench and MLE-bench. Reasoning-only and tool-augmented capability are increasingly divergent and worth measuring separately.

Long-context evaluation. Standardized "needle in a haystack" and long-form-reasoning scores at 100k+, 1M+ context. The current composite barely registers context capability, which is becoming a primary axis of differentiation.

Cost-adjusted quality. We already store pricing (price_input, price_output) and throughput (speed_tps, latency_ms). An Elo-per-dollar or benchmark-points-per-token signal would reflect a real-world decision criterion the composite ignores.

Citation velocity. We ingest citation_count and influential_citations from Semantic Scholar but do not yet use them in the score. Adding a citation-velocity signal would weight academic influence alongside community adoption.

Smarter normalization. Replacing min-max with a percentile mapping (cap at p99, floor at p1) would reduce outlier sensitivity without adding category-specific buckets.

When the formula changes meaningfully we will bump the version on this page and call out the diff. The point of a methodology page is for the reader to disagree with specific, dated artifacts.

The Bubble Brief
WEEKLY

Apply this knowledge — get the weekly brief on Tuesday 13:00 UTC.

// no spam · unsubscribe one-click · free forever

Methodology v1 · Last reviewed 2026-04-30 · Questions: contact · Unfamiliar term? glossary · about Databubble

On this page
  • 01Overview
  • 02The six signals
  • 03Composite formula
  • 04Worked example
  • 05Update cadence
  • 06Limitations & trade-offs
  • 07Future signals
HomeModelsNews