arxiv3d ago

MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv:2604.21916v1 Announce Type: new Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. W

#benchmark #evaluation #language-models Read on arxiv →

arxiv3d ago

Survey on Evaluation of LLM-based Agents

arXiv:2503.16416v2 Announce Type: replace Abstract: LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasing

#evaluation #agents #benchmark Read on arxiv →

arxiv5d ago

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

arXiv:2604.19354v1 Announce Type: new Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agent

LL1 model #cybersecurity #benchmark #open-source Read on arxiv →

arxiv6d agobullish

Multilingual Training and Evaluation Resources for Vision-Language Models

arXiv:2604.18347v1 Announce Type: new Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for trainin

PIPICO3 models #multilingual #multimodal #benchmark Read on arxiv →

arxiv6d agobullish

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

arXiv:2604.17358v1 Announce Type: new Abstract: While recent Spoken Language Models (SLMs) have been actively deployed in real-world scenarios, they lack the capability to discern Third-Party Interruptions (TPI) from the primary user's ongoing flow, leaving them vulnerable to contextual failures. To

#spoken-language #dataset #evaluation Read on arxiv →

arxivApr 18bearish

Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

arXiv:2604.09982v2 Announce Type: replace-cross Abstract: Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop

COCO2 models #information retrieval #reproducibility #evaluation Read on arxiv →

arxivApr 17

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

arXiv:2604.14799v1 Announce Type: new Abstract: Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerabi

#multimodal #evaluation #abstention Read on arxiv →

arxivApr 16

From Seeing it to Experiencing it: Interactive Evaluation of Intersectional Voice Bias in Human-AI Speech Interaction

arXiv:2604.13067v1 Announce Type: cross Abstract: SpeechLLMs process spoken language directly from audio, but accent and vocal identity cues can lead to biased behaviour. Current bias evaluations often miss how such bias manifests in end-to-end speech interactions and how users experience it. We dis

#bias #speech #conversational-ai Read on arxiv →

arxivApr 14bullish

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

arXiv:2604.10520v1 Announce Type: cross Abstract: As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designe

#evaluation #code-summarization #language-models Read on arxiv →

arxivApr 14

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

arXiv:2604.10673v1 Announce Type: new Abstract: AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too

#alignment #interpretability #evaluation Read on arxiv →

arxivApr 13bullish

Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

arXiv:2604.08970v1 Announce Type: cross Abstract: We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and

LI1 model #multilingual #evaluation #benchmark Read on arxiv →

arxivApr 10bullish

The Art of Building Verifiers for Computer Use Agents

arXiv:2604.06240v1 Announce Type: cross Abstract: Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best-in-class veri

UNWEWE3 models #verification #evaluation #artificial-intelligence Read on arxiv →

arxivApr 10bearish

Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

arXiv:2601.05529v5 Announce Type: replace Abstract: High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complet

GPGEGE3 models #navigation #decision making #safety Read on arxiv →

arxivApr 10bearish

Benchmarking LLM Tool-Use in the Wild

arXiv:2604.06185v1 Announce Type: cross Abstract: Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behav

LA1 model #human-computer-interaction #language-models #benchmark Read on arxiv →

arxivApr 9

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

arXiv:2604.06484v1 Announce Type: new Abstract: Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can gro

#multimodal #evaluation #culture Read on arxiv →

arxivApr 9

Continuous Interpretive Steering for Scalar Diversity

arXiv:2604.07006v1 Announce Type: new Abstract: Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. Howe

LA1 model #pragmatic-inference #language-models #interpretability Read on arxiv →

arxivApr 7

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

arXiv:2604.02368v2 Announce Type: replace Abstract: As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks su

#benchmark #evaluation #expert-level Read on arxiv →

arxivApr 7

TimeSeek: Temporal Reliability of Agentic Forecasters

arXiv:2604.04220v1 Announce Type: new Abstract: We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with a

TI1 model #benchmark #forecasting #evaluation Read on arxiv →

arxivApr 6bearish

Multimodal Language Models Cannot Spot Spatial Inconsistencies

arXiv:2604.00799v2 Announce Type: replace-cross Abstract: Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D ge

#computer-vision #machine-learning #evaluation Read on arxiv →