arxivJul 16

The Illusion of Robustness: Aggregate Accuracy Hides Prediction Flips under Task-Irrelevant Context

arXiv:2607.12963v2 Announce Type: new Abstract: As large language models (LLMs) grow more capable, they are increasingly deployed in context-rich settings where task inputs are often accompanied by long, partially irrelevant context. In a controlled setting, we find that state-of-the-art models ofte

#language-models #reliability #evaluation Read on arxiv →

arxivJun 10

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

arXiv:2606.10298v1 Announce Type: new Abstract: When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm tha

#reliability #language-models #evaluation Read on arxiv →

arxivMay 4

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

arXiv:2605.00326v1 Announce Type: new Abstract: Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is cons

ZE1 model #safety #benchmark #calibration Read on arxiv →

arxivApr 28

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

arXiv:2604.24278v1 Announce Type: cross Abstract: Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses so

#speech-recognition #reliability #evaluation Read on arxiv →

arxivApr 21bullish

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

arXiv:2507.16727v3 Announce Type: replace Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based

#reliability #research #question-answering Read on arxiv →

arxivApr 17

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

arXiv:2604.14799v1 Announce Type: new Abstract: Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerabi

#multimodal #evaluation #abstention Read on arxiv →

arxivApr 10bullish

QA-MoE: Towards a Continuous Reliability Spectrum with Quality-Aware Mixture of Experts for Robust Multimodal Sentiment Analysis

arXiv:2604.05704v2 Announce Type: replace Abstract: Multimodal Sentiment Analysis (MSA) aims to infer human sentiment from textual, acoustic, and visual signals. In real-world scenarios, however, multimodal inputs are often compromised by dynamic noise or modality missingness. Existing methods typic

QA1 model #multimodal-sentiment-analysis #reliability #quality-aware Read on arxiv →

arxivApr 3bullish

Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents

arXiv:2604.00137v1 Announce Type: new Abstract: Tool-integrated LLMs can retrieve, compute, and take real-world actions via external tools, but reliability remains a key bottleneck. We argue that failures stem from both tool-use accuracy (how well an agent invokes a tool) and intrinsic tool accuracy

#reliability #benchmark #open-source Read on arxiv →