·
DataBubble
  • Home
  • Models
  • News
  • Compare
  • Boards
  • Pricing
  • About
  • Newsletter
  • Methodology
  • Contact
Latest
SpaceX officially prices shares at $135 in the largest IPO ever5h◆Our new community investments in Virginia support local jobs and expand energy affordability.5h◆SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift5h◆Amazon’s data centers used 2.5 billion gallons of water last year8h◆Deezer’s new tool can identify AI music from Spotify, Apple Music, and others9h◆Pool’s new app turns your screenshots into something useful10h◆DoorDash’s new AI chatbot lets you order with prompts and photos11h◆Anthropic apologizes for invisible Claude Fable guardrails14h◆Google DeepMind is worried about what happens when millions of agents start to interact14h◆Deezer launches an AI music detector for other streaming services17h◆Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing21h◆MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning21h◆Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!21h◆ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation21h◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions21h◆The Impossibility of Eliciting Latent Knowledge21h◆Mapping Scientific Literature with Large Language Models and Topic Modeling21h◆Grounding Computer Use Agents on Human Demonstrations21h◆Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models21h◆LSTM based IoT Device Identification21h◆SpaceX officially prices shares at $135 in the largest IPO ever5h◆Our new community investments in Virginia support local jobs and expand energy affordability.5h◆SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift5h◆Amazon’s data centers used 2.5 billion gallons of water last year8h◆Deezer’s new tool can identify AI music from Spotify, Apple Music, and others9h◆Pool’s new app turns your screenshots into something useful10h◆DoorDash’s new AI chatbot lets you order with prompts and photos11h◆Anthropic apologizes for invisible Claude Fable guardrails14h◆Google DeepMind is worried about what happens when millions of agents start to interact14h◆Deezer launches an AI music detector for other streaming services17h◆Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing21h◆MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning21h◆Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!21h◆ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation21h◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions21h◆The Impossibility of Eliciting Latent Knowledge21h◆Mapping Scientific Literature with Large Language Models and Topic Modeling21h◆Grounding Computer Use Agents on Human Demonstrations21h◆Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models21h◆LSTM based IoT Device Identification21h◆
Tag

#benchmark

67 articles tagged #benchmark

arxiv1d agobearish

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we

LA1 model#evaluation#benchmark#mathematicsRead on arxiv →
arxiv1d agobullish

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

arXiv:2602.12424v2 Announce Type: replace-cross Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to

#evaluation#benchmark#language-modelsRead on arxiv →
arxiv5d ago

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

arXiv:2504.10823v4 Announce Type: replace-cross Abstract: Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-base

GPCL2 models#value-based decision-making#llms#benchmarkRead on arxiv →
arxiv5d agobullish

Benchmark Everything Everywhere All at Once

arXiv:2606.06462v1 Announce Type: new Abstract: Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalabili

#benchmark#llms#autonomous-systemsRead on arxiv →
arxiv6d agobullish

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

arXiv:2606.05557v1 Announce Type: new Abstract: A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AUR

AURE2 models#natural-language-processing#inference#benchmarkRead on arxiv →
arxivJun 2bullish

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

arXiv:2606.01230v1 Announce Type: new Abstract: Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and p

HOHOGP3 models#smart-home#language-models#benchmarkRead on arxiv →
arxivJun 2

OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models

arXiv:2606.00815v1 Announce Type: new Abstract: Electroencephalography (EEG) supports a variety of brain-computer interface (BCI) tasks ranging from brain-state monitoring to human-LLM interactions. EEG foundation models are emerging, but evaluation remains fragmented due to heterogeneous datasets a

#benchmark#machine-learning#neuroscienceRead on arxiv →
arxivJun 2

Do Joint Audio-Video Generation Models Understand Physics?

arXiv:2605.07061v2 Announce Type: replace-cross Abstract: Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consisten

SE1 model#benchmark#audio-video#generationRead on arxiv →
arxivJun 1bullish

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

arXiv:2605.31183v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs

SPLALO3 models#language-models#benchmark#interpretabilityRead on arxiv →
arxivMay 29

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

arXiv:2509.23571v3 Announce Type: replace-cross Abstract: As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat an

#cybersecurity#threat-hunting#benchmarkRead on arxiv →
arxivMay 29

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation

arXiv:2605.29741v1 Announce Type: new Abstract: The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific termin

GPGENL4 models · +1#machine translation#african languages#scientific communicationRead on arxiv →
arxivMay 29

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

arXiv:2605.22100v2 Announce Type: replace Abstract: Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for r

#document-parsing#benchmark#information-systemsRead on arxiv →
arxivMay 28bullish

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

arXiv:2604.17943v2 Announce Type: replace Abstract: RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and ev

ME1 model#benchmark#evaluation#specialist-domainsRead on arxiv →
arxivMay 28

Revisiting Metafeatures to Explain Model Differences on Tabular Data

arXiv:2605.28418v1 Announce Type: new Abstract: With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps bet

TATA2 models#machine learning#benchmark#tabular dataRead on arxiv →
arxivMay 27

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally va

#education#benchmark#video-generationRead on arxiv →
arxivMay 26

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

arXiv:2605.23940v1 Announce Type: new Abstract: How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent

MU1 model#reasoning#benchmark#multi-turnRead on arxiv →
arxivMay 22

Robust Reasoning Benchmark

arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13

CL1 model#benchmark#mathematical reasoning#large language modelsRead on arxiv →
arxivMay 21

Refining and Reusing Annotation Guidelines for LLM Annotation

arXiv:2605.20809v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelin

GPGEDE3 models#research#language models#benchmarkRead on arxiv →
arxivMay 19

Fidelity Probes for Specification--Code Alignment

arXiv:2605.17246v1 Announce Type: cross Abstract: We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into

LLANDE7 models · +4#machine learning#artificial intelligence#benchmarkRead on arxiv →
arxivMay 19

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

arXiv:2605.14068v2 Announce Type: replace-cross Abstract: We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and

GEQWQW5 models · +2#benchmark#computer vision#topological reasoningRead on arxiv →
arxivMay 16

PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

arXiv:2605.14002v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize "long-tail" fact

#benchmark#information-retrieval#multilingualRead on arxiv →
arxivMay 16

FutureSim: Replaying World Events to Evaluate Adaptive Agents

arXiv:2605.15188v1 Announce Type: cross Abstract: AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay

#benchmark#adaptation#machine-learningRead on arxiv →
arxivMay 14bullish

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

arXiv:2601.20255v2 Announce Type: replace-cross Abstract: SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supe

#benchmark#software-engineering#large-language-modelsRead on arxiv →
arxivMay 11

How Value Induction Reshapes LLM Behaviour

arXiv:2605.07925v1 Announce Type: new Abstract: Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility,

#language-models#value-induction#safetyRead on arxiv →
arxivMay 11bullish

End-to-end PDDL Planning with Hardcoded and Dynamic Agents

arXiv:2512.09629v2 Announce Type: replace Abstract: We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem

OPGPGP5 models · +2#planning#natural-language-processing#large-language-modelsRead on arxiv →
arxivMay 11

The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking

arXiv:2605.06707v1 Announce Type: cross Abstract: This paper presents an eight-week observational comparison of 68 single-file HTML generations collected across 17 public experiments in the "HTML AI Battle" project between December 10, 2025 and February 4, 2026. Four reasoning model families, GPT, G

GPGEGR4 models · +1#software engineering#artificial intelligence#benchmarkRead on arxiv →
arxivMay 8

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv:2605.05726v1 Announce Type: new Abstract: As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption b

#benchmark#llm#retrievalRead on arxiv →
arxivMay 8

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

arXiv:2605.06136v1 Announce Type: cross Abstract: Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working

#benchmark#software-engineering#artificial-intelligenceRead on arxiv →
arxivMay 8

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an ob

OLOLMI7 models · +4#safety#benchmark#evaluationRead on arxiv →
arxivMay 8

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

arXiv:2603.18257v2 Announce Type: replace-cross Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observational select

SA1 model#machine-learning#artificial-intelligence#reinforcement-learningRead on arxiv →
arxivMay 8

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are bri

#language-models#monitoring#safetyRead on arxiv →
arxivMay 5

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

arXiv:2508.07630v2 Announce Type: replace-cross Abstract: We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and publi

#benchmark#vision-language#multimodal-reasoningRead on arxiv →
arxivMay 4

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

arXiv:2605.00326v1 Announce Type: new Abstract: Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is cons

ZE1 model#safety#benchmark#calibrationRead on arxiv →
arxivMay 1

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment

arXiv:2506.22500v2 Announce Type: replace-cross Abstract: Automated identification of surgical safety risks is critical for improving patient outcomes; however, Multimodal Large Language Models (MLLMs) frequently suffer from Visual-Semantic Knowledge Conflicts (VS-KC), a phenomenon where models poss

#safety#medical#computer-visionRead on arxiv →
arxivMay 1bullish

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

arXiv:2604.28039v1 Announce Type: new Abstract: Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a p

#multimodal#benchmark#scientific-researchRead on arxiv →
arxivMay 1bearish

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

arXiv:2604.28139v1 Announce Type: cross Abstract: LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult

#benchmark#workflow#evaluationRead on arxiv →
arxivApr 30

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

arXiv:2505.21190v2 Announce Type: replace-cross Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-g

#radiology#benchmark#evaluationRead on arxiv →
arxivApr 29bullish

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

arXiv:2604.24544v1 Announce Type: new Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, re

LATG2 models#benchmark#evaluation#language-modelsRead on arxiv →
arxivApr 27bullish

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

arXiv:2604.14306v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describ

LA1 model#multilingual#medical-ai#benchmarkRead on arxiv →
arxivApr 24bullish

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

arXiv:2604.19856v1 Announce Type: cross Abstract: Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches su

CHMACH4 models · +1#hardware#synthesis#generationRead on arxiv →
arxivApr 24

MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv:2604.21916v1 Announce Type: new Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. W

#benchmark#evaluation#language-modelsRead on arxiv →
arxivApr 24bullish

Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

arXiv:2506.12721v2 Announce Type: replace Abstract: Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To

#large-language-models#compute-optimization#bandit-learningRead on arxiv →
arxivApr 24

Survey on Evaluation of LLM-based Agents

arXiv:2503.16416v2 Announce Type: replace Abstract: LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasing

#evaluation#agents#benchmarkRead on arxiv →
arxivApr 23bullish

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

arXiv:2604.19750v1 Announce Type: cross Abstract: Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle

GE1 model#gui#debugging#benchmarkRead on arxiv →
arxivApr 23

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

arXiv:2604.16902v2 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this g

#research#language-models#multimodalRead on arxiv →
arxivApr 22

Owner-Harm: A Missing Threat Model for AI Agent Safety

arXiv:2604.18658v1 Announce Type: cross Abstract: Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-wor

AGAGLL3 models#safety#security#benchmarkRead on arxiv →
arxivApr 22

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

arXiv:2604.19354v1 Announce Type: new Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agent

LL1 model#cybersecurity#benchmark#open-sourceRead on arxiv →
arxivApr 21bearish

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

arXiv:2604.10577v2 Announce Type: replace-cross Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit thre

CL1 model#safety#security#benchmarkRead on arxiv →
arxivApr 21

Using large language models for embodied planning introduces systematic safety risks

arXiv:2604.18463v1 Announce Type: cross Abstract: Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normativ

#safety#benchmark#roboticsRead on arxiv →
arxivApr 21bullish

Multilingual Training and Evaluation Resources for Vision-Language Models

arXiv:2604.18347v1 Announce Type: new Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for trainin

PIPICO3 models#multilingual#multimodal#benchmarkRead on arxiv →
arxivApr 21

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

arXiv:2603.24621v2 Announce Type: replace Abstract: We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action

#benchmark#intelligence#researchRead on arxiv →
arxivApr 17bullish

ExpSeek: Self-Triggered Experience Seeking for Web Agents

arXiv:2601.08605v2 Announce Type: replace-cross Abstract: Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience p

QWQW4B3 models#experience-intervention#web-agents#benchmarkRead on arxiv →
arxivApr 17

Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

arXiv:2604.14210v1 Announce Type: new Abstract: A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching t

MIGL2 models#language-models#efficiency#benchmarkRead on arxiv →
arxivApr 13bullish

Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

arXiv:2604.08970v1 Announce Type: cross Abstract: We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and

LI1 model#multilingual#evaluation#benchmarkRead on arxiv →
arxivApr 10bearish

Benchmarking LLM Tool-Use in the Wild

arXiv:2604.06185v1 Announce Type: cross Abstract: Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behav

LA1 model#human-computer-interaction#language-models#benchmarkRead on arxiv →
arxivApr 10

Matrix Profile for Anomaly Detection on Multidimensional Time Series

arXiv:2409.09298v2 Announce Type: replace-cross Abstract: The Matrix Profile (MP), a versatile tool for time series data mining, has been shown effective in time series anomaly detection (TSAD). This paper delves into the problem of anomaly detection in multidimensional time series, a common occurre

#time-series#anomaly-detection#machine-learningRead on arxiv →
arxivApr 8

PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

arXiv:2604.05775v1 Announce Type: new Abstract: Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical val

LA1 model#genomics#benchmark#biological sequencesRead on arxiv →
arxivApr 7bullish

Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark

arXiv:2604.04226v1 Announce Type: cross Abstract: Agentic Web, as a new paradigm that redefines the internet through autonomous, goal-driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive

#multiagent#artificial-intelligence#benchmarkRead on arxiv →
arxivApr 7

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

arXiv:2512.03666v2 Announce Type: replace-cross Abstract: A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain large

#computer-vision#benchmark#embodied-intelligenceRead on arxiv →
arxivApr 7

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

arXiv:2604.02368v2 Announce Type: replace Abstract: As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks su

#benchmark#evaluation#expert-levelRead on arxiv →
arxivApr 7

TimeSeek: Temporal Reliability of Agentic Forecasters

arXiv:2604.04220v1 Announce Type: new Abstract: We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with a

TI1 model#benchmark#forecasting#evaluationRead on arxiv →
arxivApr 6bullish

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

arXiv:2604.02869v1 Announce Type: new Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Rela

QWQWGP5 models · +2#reinforcement-learning#conversational-ai#benchmarkRead on arxiv →
arxivApr 6bearish

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

arXiv:2604.02947v1 Announce Type: new Abstract: Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. T

CLOPIF7 models · +4#safety#benchmark#autonomous agentsRead on arxiv →
arxivApr 4

From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents

arXiv:2604.01733v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems critically depend on retrieval quality, yet no systematic comparison of modern retrieval methods exists for heterogeneous documents containing both text and tabular data. We benchmark ten retrieval strateg

BMHYHY3 models#information-retrieval#benchmark#financial-qaRead on arxiv →
arxivApr 3bearish

Can LLMs Perceive Time? An Empirical Investigation

arXiv:2604.00010v1 Announce Type: cross Abstract: Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p < 0.001$), with mod

GP1 model#language-models#benchmark#safetyRead on arxiv →
arxivApr 3bullish

Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents

arXiv:2604.00137v1 Announce Type: new Abstract: Tool-integrated LLMs can retrieve, compute, and take real-world actions via external tools, but reliability remains a key bottleneck. We argue that failures stem from both tool-use accuracy (how well an agent invokes a tool) and intrinsic tool accuracy

#reliability#benchmark#open-sourceRead on arxiv →
arxivApr 2bullish

SkillRouter: Skill Routing for LLM Agents at Scale

arXiv:2603.22455v4 Announce Type: replace Abstract: Reusable skills let LLM agents package task-specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasi

SK1 model#machine-learning#benchmark#routingRead on arxiv →
HomeModelsNews