Tag

#benchmark

100 articles tagged #benchmark

arxiv4d ago

Verifier-First Evaluation of Agentic LLMs for Infrastructure-as-Code Generation

arXiv:2607.20478v1 Announce Type: cross Abstract: Infrastructure-as-Code (IaC) generation from natural language requires satisfying provider schemas, dependency planning, and organizational policy constraints, not merely producing syntactically plausible configurations. We present a verifier-first e

QWQWGP8 models · +5 #software engineering #infrastructure-as-code #terraform Read on arxiv →

arxiv4d ago

CultureTalk-ID: A Multi-Task Dialogue Benchmark for Cultural Commonsense in Indonesian Local Languages

arXiv:2607.21016v1 Announce Type: new Abstract: Culture is lived through conversation, yet existing Indonesian cultural commonsense benchmarks evaluate LLMs on short and isolated prompts, stripping away the dialogic context in which cultural nuances actually surface. We introduce CultureTalk-ID, the

LL1 model #benchmark #cultural-commonsense #language-models Read on arxiv →

arxiv6d ago

Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents

arXiv:2506.17913v2 Announce Type: replace Abstract: Graphical User Interface (GUI) agents have made significant progress in automating digital tasks through the utilization of computer vision and language models. Nevertheless, existing agent systems encounter notable limitations. Firstly, they predo

CO1 model #gui-automation #cognitive-architectures #benchmark Read on arxiv →

arxiv6d agobullish

SechKAN: Kolmogorov-Arnold Networks with Hyperbolic Secant Functions

arXiv:2607.18290v1 Announce Type: cross Abstract: In recent years, Kolmogorov-Arnold Networks (KANs) have attracted increasing attention due to their effectiveness in machine learning and scientific computing tasks, offering a new paradigm for neural network design. In this paper, we present SechKAN

SEMU2 models #machine-learning #neural-networks #scientific-computing Read on arxiv →

arxiv6d ago

Towards Principled Continual Anomaly Detection: A Systematic Framework and Benchmark Scenarios

arXiv:2607.18289v1 Announce Type: cross Abstract: Continual anomaly detection (CAD) studies how models can adapt to evolving data distributions while retaining performance on previously observed regimes. CAD benchmarks, however, depend critically on how tasks are defined, filtered, ordered, and vali

#anomaly-detection #continual-learning #benchmark Read on arxiv →

arxivJul 18bullish

VideoSEMA: a scalable and efficient Mamba-like attention for video understanding

arXiv:2607.14711v1 Announce Type: cross Abstract: We present for video understanding (classification) a split space-time attention model, VideoSEMA, consisting of a scalable and efficient Mamba-like attention (SEMA) block in space and a softmax temporal attention in time. In each frame, SEMA attenti

VIMAVI3 models #video understanding #attention models #computer vision Read on arxiv →

arxivJul 18

Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization

arXiv:2605.21751v2 Announce Type: replace Abstract: Text-to-optimization requires two separable capabilities: modeling -- choosing the right optimization structure -- and binding -- grounding every coefficient, index, and parameter in the concrete problem data. We study this via Text2Opt-Bench, a sc

#optimization #machine-learning #benchmark Read on arxiv →

arxivJul 17bearish

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

arXiv:2607.00724v3 Announce Type: replace Abstract: Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption directly, we intr

LL1 model #multilingual #benchmark #cultural-alignment Read on arxiv →

arxivJul 16bullish

UESF-Bench: Benchmarking and Probing for Unified Embodied Seeking and Following

arXiv:2607.13621v1 Announce Type: new Abstract: Language-guided human following is an important capability for embodied agents, but existing benchmarks typically assume that the target person is visible at the start of an episode. This setting simplifies the problem and overlooks a more realistic re

SE1 model #benchmark #embodied-agents #language-guided Read on arxiv →

arxivJul 14bullish

Index SLM Technical Report

arXiv:2607.09885v1 Announce Type: new Abstract: We present Index-1.9B, a series of open small language models developed at Bilibili. The series comprises four models: Index-1.9B-Base, a foundation model with 1.9 billion non-embedding parameters pre-trained on 2.8 trillion predominantly Chinese and E

INININ4 models · +1 #open-source #language-models #pre-training Read on arxiv →

arxivJul 14

LongMedBench: Benchmarking Medical Agents for Long-Horizon Clinical Decision-Making

arXiv:2607.09322v2 Announce Type: replace Abstract: In this work, we introduce LongMedBench, a real-world EHR-based benchmark for long-horizon clinical decision-making. Prior evaluations of LLM-based medical agents have largely emphasized short-context knowledge QA and tool use. However, real-world

RA1 model #benchmark #medical #evaluation Read on arxiv →

arxivJul 10bullish

Aligning Clinical Needs and AI Capabilities: A Survey on LLMs for Medical Reasoning

arXiv:2607.07761v1 Announce Type: new Abstract: Large language models (LLMs) have emerged as important tools in healthcare, showing growing potential for clinical reasoning and patient care. This survey examines recent progress in medical LLMs, focusing on reasoning applications and requirements. We

#medical-reasoning #large-language-models #benchmark Read on arxiv →

arxivJul 10

Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models

arXiv:2607.08317v1 Announce Type: new Abstract: Modern AI models achieve strong performance on many established benchmarks, yet they still fail on tasks that humans find almost trivial, such as manipulating a string or drawing a dog with five legs. These examples suggest that existing benchmarks may

#benchmark #evaluation #ai-research Read on arxiv →

arxivJul 10

Persona Cartography: Charting Language Model Personality Traits in Weight Space

arXiv:2607.07916v1 Announce Type: new Abstract: Large language models exhibit recurring behavioural patterns -- personas -- that shape generalisation and safety, but we lack reliable tools for decomposing, measuring, and controlling them. Our central insight is to treat personas as positions in a sp

#safety #personality #benchmark Read on arxiv →

arxivJul 10

Understanding Axes of Difficulty For Long Context Tasks Via PredicateLongBench

arXiv:2607.08284v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated rapidly improving long-context capabilities, prompting a wave of benchmarks designed to evaluate them. However, existing long-context evaluations - from Needle-in-a-Haystack (NIAH) tests to more recent mul

#benchmark #long-context #evaluation Read on arxiv →

arxivJul 10bullish

DeepPySR -- A Symbolic Regression Framework with Dynamic Pruning, Pareto Selection, and Hierarchical Composition for Real-World Scientific Discovery

arXiv:2607.08150v1 Announce Type: new Abstract: Symbolic regression (SR) discovers analytical equations from data, yielding glass-box models with directly interpretable formulas, unlike black-box methods that rely on unstable post-hoc tools such as SHAP or LIME. This transparency is crucial in clini

DEPYSH4 models · +1 #symbolic regression #interpretable models #benchmark Read on arxiv →

arxivJul 3

Parameter Golf: What Really Works?

arXiv:2607.01517v1 Announce Type: new Abstract: How far can a language model improve under a strict artifact budget? Parameter Golf posed this question as an open community challenge in which participants trained the best language model, with the complete artifact (training code + compressed weights

#optimization #language-models #benchmark Read on arxiv →

arxivJul 2bullish

WorkBench Revisited: Workplace Agents Two Years On

arXiv:2606.13715v2 Announce Type: replace Abstract: The best agent on WorkBench in March 2024, GPT-4, completed just 43% of tasks. We revisit the benchmark in June 2026 and find that the best agent to date, Claude Fable 5, now completes 98%. Beyond this considerable progress in frontier agent perfor

OPCL2 models #benchmark #safety #open-source Read on arxiv →

arxivJul 1bullish

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

arXiv:2606.31551v1 Announce Type: new Abstract: Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that autonomous post-training is no

GPDE2 models #autonomous training #language models #benchmark Read on arxiv →

arxivJun 30bearish

DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

arXiv:2604.05318v2 Announce Type: replace Abstract: Harmful content detectors, particularly disinformation classifiers, are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark f

NUNUNU3 models #disinformation #dialects #language Read on arxiv →

arxivJun 29

Foundation vs. Specialized Models: Evaluating Catastrophic Forgetting in Continual Time Series Forecasting

arXiv:2510.00809v3 Announce Type: replace Abstract: While Time Series Foundation Models (TSFMs) excel in zero-shot tasks, their behavior under continual fine tuning is poorly understood. We present the first systematic study of catastrophic forgetting in TSFMs (TimesFM-2.0, Chronos-2) versus a speci

TICHSA3 models #time-series #continual-learning #catastrophic-forgetting Read on arxiv →

arxivJun 27

MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG

arXiv:2606.26793v1 Announce Type: cross Abstract: Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approac

#security #adversarial-attacks #multimodal Read on arxiv →

arxivJun 27bullish

Boundary-Aware Context Grounding for A Low-Channel EEG Agent

arXiv:2606.26519v1 Announce Type: new Abstract: Large language models (LLMs) can make scientific software easier to use. However, a general model does not automatically know which measurements a particular sensor can support, which algorithms are implemented in the current software, or which conclus

NE1 model #open-source #eeg #scientific-software Read on arxiv →

arxivJun 25bearish

Riemann-Bench: A Benchmark for Moonshot Mathematics

arXiv:2604.06802v3 Announce Type: replace Abstract: Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of m

#mathematics #benchmark #research Read on arxiv →

arxivJun 25bullish

2.5-D Decomposition for LLM-Based Spatial Construction

arXiv:2605.07066v3 Announce Type: replace Abstract: Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-s

GPGPNE3 models #autonomous-systems #spatial-reasoning #neuro-symbolic Read on arxiv →

arxivJun 20

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

arXiv:2606.15862v2 Announce Type: replace Abstract: Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simula

#benchmark #evaluation #autonomy Read on arxiv →

arxivJun 20

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

arXiv:2606.18191v2 Announce Type: replace Abstract: Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows

DR1 model #workflow #benchmark #personalization Read on arxiv →

arxivJun 20bullish

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

arXiv:2506.14990v3 Announce Type: replace Abstract: Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as C

#reinforcement-learning #benchmark #multi-agent Read on arxiv →

arxivJun 19bullish

Physics-Informed Neural Network with Squeeze-Excitation-like Attention

arXiv:2606.19853v1 Announce Type: new Abstract: We introduce SEA-PINN, a novel architecture that incorporates a Squeeze-Excitation-like attention mechanism into physics-informed neural networks to dynamically recalibrate the importance of neurons across layers. A key feature of SEA-PINN is its highl

SEFNTS3 models #physics-informed-neural-networks #machine-learning #benchmark Read on arxiv →

arxivJun 18

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

arXiv:2606.18640v1 Announce Type: new Abstract: Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized that the lack of stand

ME1 model #glucose-forecasting #benchmark #multimodal Read on arxiv →

arxivJun 18bullish

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

arXiv:2505.23851v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution dataset of 35,368 va

#mathematics #evaluation #benchmark Read on arxiv →

arxivJun 18bullish

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text on

CL1 model #computer-use-agents #multimodal-skills #gui-interaction Read on arxiv →

arxivJun 18

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

arXiv:2606.18733v1 Announce Type: cross Abstract: Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid

#software-engineering #artificial-intelligence #benchmark Read on arxiv →

huggingfaceJun 17bullish

GLM-5.2: Built for Long-Horizon Tasks

GLGLOP7 models · +4 #open-source #benchmark #long-horizon tasks Read on huggingface →

arxivJun 17

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

arXiv:2606.17266v1 Announce Type: new Abstract: Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker

#production-planning #workforce-management #benchmark Read on arxiv →

arxivJun 16bullish

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

arXiv:2603.02668v2 Announce Type: replace Abstract: We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yi

GE1 model #benchmark #open-source #mathematics Read on arxiv →

arxivJun 12bullish

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

arXiv:2606.13608v1 Announce Type: new Abstract: Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse

#benchmark #evaluation #artificial-intelligence Read on arxiv →

arxivJun 12

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

arXiv:2601.13591v2 Announce Type: replace Abstract: Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a

CLMIGP4 models · +1 #benchmark #evaluation #data science Read on arxiv →

arxivJun 12

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

arXiv:2606.13602v1 Announce Type: new Abstract: We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark

GPGPCL4 models · +1 #benchmark #evaluation #epigenomics Read on arxiv →

arxivJun 12bearish

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

arXiv:2606.13385v1 Announce Type: cross Abstract: Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks

#security #benchmark #vulnerability Read on arxiv →

arxivJun 12

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

arXiv:2605.26418v2 Announce Type: replace Abstract: A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reprod

PPDQA26 models · +3 #reinforcement-learning #benchmark #resource-control Read on arxiv →

arxivJun 10bearish

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we

LA1 model #evaluation #benchmark #mathematics Read on arxiv →

arxivJun 10bullish

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

arXiv:2602.12424v2 Announce Type: replace-cross Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to

#evaluation #benchmark #language-models Read on arxiv →

arxivJun 6

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

arXiv:2504.10823v4 Announce Type: replace-cross Abstract: Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-base

GPCL2 models #value-based decision-making #llms #benchmark Read on arxiv →

arxivJun 6bullish

Benchmark Everything Everywhere All at Once

arXiv:2606.06462v1 Announce Type: new Abstract: Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalabili

#benchmark #llms #autonomous-systems Read on arxiv →

arxivJun 5bullish

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

arXiv:2606.05557v1 Announce Type: new Abstract: A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AUR

AURE2 models #natural-language-processing #inference #benchmark Read on arxiv →

arxivJun 2bullish

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

arXiv:2606.01230v1 Announce Type: new Abstract: Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and p

HOHOGP3 models #smart-home #language-models #benchmark Read on arxiv →

arxivJun 2

OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models

arXiv:2606.00815v1 Announce Type: new Abstract: Electroencephalography (EEG) supports a variety of brain-computer interface (BCI) tasks ranging from brain-state monitoring to human-LLM interactions. EEG foundation models are emerging, but evaluation remains fragmented due to heterogeneous datasets a

#benchmark #machine-learning #neuroscience Read on arxiv →

arxivJun 2

Do Joint Audio-Video Generation Models Understand Physics?

arXiv:2605.07061v2 Announce Type: replace-cross Abstract: Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consisten

SE1 model #benchmark #audio-video #generation Read on arxiv →

arxivJun 1bullish

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

arXiv:2605.31183v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs

SPLALO3 models #language-models #benchmark #interpretability Read on arxiv →

arxivMay 29

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

arXiv:2509.23571v3 Announce Type: replace-cross Abstract: As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat an

#cybersecurity #threat-hunting #benchmark Read on arxiv →

arxivMay 29

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation

arXiv:2605.29741v1 Announce Type: new Abstract: The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific termin

GPGENL4 models · +1 #machine translation #african languages #scientific communication Read on arxiv →

arxivMay 29

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

arXiv:2605.22100v2 Announce Type: replace Abstract: Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for r

#document-parsing #benchmark #information-systems Read on arxiv →

arxivMay 28bullish

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

arXiv:2604.17943v2 Announce Type: replace Abstract: RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and ev

ME1 model #benchmark #evaluation #specialist-domains Read on arxiv →

arxivMay 28

Revisiting Metafeatures to Explain Model Differences on Tabular Data

arXiv:2605.28418v1 Announce Type: new Abstract: With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps bet

TATA2 models #machine learning #benchmark #tabular data Read on arxiv →

arxivMay 27

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally va

#education #benchmark #video-generation Read on arxiv →

arxivMay 26

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

arXiv:2605.23940v1 Announce Type: new Abstract: How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent

MU1 model #reasoning #benchmark #multi-turn Read on arxiv →

arxivMay 22

Robust Reasoning Benchmark

arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13

CL1 model #benchmark #mathematical reasoning #large language models Read on arxiv →

arxivMay 21

Refining and Reusing Annotation Guidelines for LLM Annotation

arXiv:2605.20809v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelin

GPGEDE3 models #research #language models #benchmark Read on arxiv →

arxivMay 19

Fidelity Probes for Specification--Code Alignment

arXiv:2605.17246v1 Announce Type: cross Abstract: We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into

LLANDE7 models · +4 #machine learning #artificial intelligence #benchmark Read on arxiv →

arxivMay 19

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

arXiv:2605.14068v2 Announce Type: replace-cross Abstract: We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and

GEQWQW5 models · +2 #benchmark #computer vision #topological reasoning Read on arxiv →

arxivMay 16

PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

arXiv:2605.14002v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize "long-tail" fact

#benchmark #information-retrieval #multilingual Read on arxiv →

arxivMay 16

FutureSim: Replaying World Events to Evaluate Adaptive Agents

arXiv:2605.15188v1 Announce Type: cross Abstract: AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay

#benchmark #adaptation #machine-learning Read on arxiv →

arxivMay 14bullish

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

arXiv:2601.20255v2 Announce Type: replace-cross Abstract: SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supe

#benchmark #software-engineering #large-language-models Read on arxiv →

arxivMay 11

How Value Induction Reshapes LLM Behaviour

arXiv:2605.07925v1 Announce Type: new Abstract: Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility,

#language-models #value-induction #safety Read on arxiv →

arxivMay 11bullish

End-to-end PDDL Planning with Hardcoded and Dynamic Agents

arXiv:2512.09629v2 Announce Type: replace Abstract: We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem

OPGPGP5 models · +2 #planning #natural-language-processing #large-language-models Read on arxiv →

arxivMay 11

The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking

arXiv:2605.06707v1 Announce Type: cross Abstract: This paper presents an eight-week observational comparison of 68 single-file HTML generations collected across 17 public experiments in the "HTML AI Battle" project between December 10, 2025 and February 4, 2026. Four reasoning model families, GPT, G

GPGEGR4 models · +1 #software engineering #artificial intelligence #benchmark Read on arxiv →

arxivMay 8

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

arXiv:2605.05726v1 Announce Type: new Abstract: As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption b

#benchmark #llm #retrieval Read on arxiv →

arxivMay 8

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

arXiv:2605.06136v1 Announce Type: cross Abstract: Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working

#benchmark #software-engineering #artificial-intelligence Read on arxiv →

arxivMay 8

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an ob

OLOLMI7 models · +4 #safety #benchmark #evaluation Read on arxiv →

arxivMay 8

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

arXiv:2603.18257v2 Announce Type: replace-cross Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observational select

SA1 model #machine-learning #artificial-intelligence #reinforcement-learning Read on arxiv →

arxivMay 8

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are bri

#language-models #monitoring #safety Read on arxiv →

arxivMay 5

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

arXiv:2508.07630v2 Announce Type: replace-cross Abstract: We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and publi

#benchmark #vision-language #multimodal-reasoning Read on arxiv →

arxivMay 4

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

arXiv:2605.00326v1 Announce Type: new Abstract: Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is cons

ZE1 model #safety #benchmark #calibration Read on arxiv →

arxivMay 1

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment

arXiv:2506.22500v2 Announce Type: replace-cross Abstract: Automated identification of surgical safety risks is critical for improving patient outcomes; however, Multimodal Large Language Models (MLLMs) frequently suffer from Visual-Semantic Knowledge Conflicts (VS-KC), a phenomenon where models poss

#safety #medical #computer-vision Read on arxiv →

arxivMay 1bullish

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

arXiv:2604.28039v1 Announce Type: new Abstract: Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a p

#multimodal #benchmark #scientific-research Read on arxiv →

arxivMay 1bearish

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

arXiv:2604.28139v1 Announce Type: cross Abstract: LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult

#benchmark #workflow #evaluation Read on arxiv →

arxivApr 30

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

arXiv:2505.21190v2 Announce Type: replace-cross Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-g

#radiology #benchmark #evaluation Read on arxiv →

arxivApr 29bullish

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

arXiv:2604.24544v1 Announce Type: new Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, re

LATG2 models #benchmark #evaluation #language-models Read on arxiv →

arxivApr 27bullish

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

arXiv:2604.14306v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describ

LA1 model #multilingual #medical-ai #benchmark Read on arxiv →

arxivApr 24bullish

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

arXiv:2604.19856v1 Announce Type: cross Abstract: Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches su

CHMACH4 models · +1 #hardware #synthesis #generation Read on arxiv →

arxivApr 24

MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv:2604.21916v1 Announce Type: new Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. W

#benchmark #evaluation #language-models Read on arxiv →

arxivApr 24bullish

Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

arXiv:2506.12721v2 Announce Type: replace Abstract: Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To

#large-language-models #compute-optimization #bandit-learning Read on arxiv →

arxivApr 24

Survey on Evaluation of LLM-based Agents

arXiv:2503.16416v2 Announce Type: replace Abstract: LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasing

#evaluation #agents #benchmark Read on arxiv →

arxivApr 23bullish

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

arXiv:2604.19750v1 Announce Type: cross Abstract: Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle

GE1 model #gui #debugging #benchmark Read on arxiv →

arxivApr 23

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

arXiv:2604.16902v2 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this g

#research #language-models #multimodal Read on arxiv →

arxivApr 22

Owner-Harm: A Missing Threat Model for AI Agent Safety

arXiv:2604.18658v1 Announce Type: cross Abstract: Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-wor

AGAGLL3 models #safety #security #benchmark Read on arxiv →

arxivApr 22

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

arXiv:2604.19354v1 Announce Type: new Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agent

LL1 model #cybersecurity #benchmark #open-source Read on arxiv →

arxivApr 21bearish

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

arXiv:2604.10577v2 Announce Type: replace-cross Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit thre

CL1 model #safety #security #benchmark Read on arxiv →

arxivApr 21

Using large language models for embodied planning introduces systematic safety risks

arXiv:2604.18463v1 Announce Type: cross Abstract: Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normativ

#safety #benchmark #robotics Read on arxiv →

arxivApr 21bullish

Multilingual Training and Evaluation Resources for Vision-Language Models

arXiv:2604.18347v1 Announce Type: new Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for trainin

PIPICO3 models #multilingual #multimodal #benchmark Read on arxiv →

arxivApr 21

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

arXiv:2603.24621v2 Announce Type: replace Abstract: We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action

#benchmark #intelligence #research Read on arxiv →

arxivApr 17bullish

ExpSeek: Self-Triggered Experience Seeking for Web Agents

arXiv:2601.08605v2 Announce Type: replace-cross Abstract: Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience p

QWQW4B3 models #experience-intervention #web-agents #benchmark Read on arxiv →

arxivApr 17

Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

arXiv:2604.14210v1 Announce Type: new Abstract: A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching t

MIGL2 models #language-models #efficiency #benchmark Read on arxiv →

arxivApr 13bullish

Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

arXiv:2604.08970v1 Announce Type: cross Abstract: We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and

LI1 model #multilingual #evaluation #benchmark Read on arxiv →

arxivApr 10bearish

Benchmarking LLM Tool-Use in the Wild

arXiv:2604.06185v1 Announce Type: cross Abstract: Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behav

LA1 model #human-computer-interaction #language-models #benchmark Read on arxiv →

arxivApr 10

Matrix Profile for Anomaly Detection on Multidimensional Time Series

arXiv:2409.09298v2 Announce Type: replace-cross Abstract: The Matrix Profile (MP), a versatile tool for time series data mining, has been shown effective in time series anomaly detection (TSAD). This paper delves into the problem of anomaly detection in multidimensional time series, a common occurre

#time-series #anomaly-detection #machine-learning Read on arxiv →

arxivApr 8

PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

arXiv:2604.05775v1 Announce Type: new Abstract: Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical val

LA1 model #genomics #benchmark #biological sequences Read on arxiv →

arxivApr 7bullish

Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark

arXiv:2604.04226v1 Announce Type: cross Abstract: Agentic Web, as a new paradigm that redefines the internet through autonomous, goal-driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive

#multiagent #artificial-intelligence #benchmark Read on arxiv →

arxivApr 7

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

arXiv:2512.03666v2 Announce Type: replace-cross Abstract: A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain large

#computer-vision #benchmark #embodied-intelligence Read on arxiv →