Tag

#llms

11 articles tagged #llms

arxiv4d ago

Are Diversity Metrics Measuring Diversity? A Capability-Controlled Audit of Majority-Vote Gain in LLM Ensembles

arXiv:2607.20768v1 Announce Type: cross Abstract: Majority voting over LLMs is widely assumed to benefit from diversity, and diversity measures are used to choose which models to combine. We ask whether five such measures track diversity or mainly re-express capability, auditing them as predictors o

#llms #diversity #machine-learning Read on arxiv →

arxivJul 17bearish

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

arXiv:2607.00724v3 Announce Type: replace Abstract: Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption directly, we intr

LL1 model #multilingual #benchmark #cultural-alignment Read on arxiv →

arxivJul 3bullish

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

arXiv:2607.01440v1 Announce Type: new Abstract: Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it shoul

FAQW2 models #medicine #llms #reinforcement-learning Read on arxiv →

arxivJun 12

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

arXiv:2601.13591v2 Announce Type: replace Abstract: Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a

CLMIGP4 models · +1 #benchmark #evaluation #data science Read on arxiv →

arxivJun 12bullish

MiniMax Sparse Attention

arXiv:2606.13392v1 Announce Type: new Abstract: Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadr

#attention-mechanisms #llms #optimization Read on arxiv →

arxivJun 6

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

arXiv:2504.10823v4 Announce Type: replace-cross Abstract: Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-base

GPCL2 models #value-based decision-making #llms #benchmark Read on arxiv →

arxivJun 6bullish

Benchmark Everything Everywhere All at Once

arXiv:2606.06462v1 Announce Type: new Abstract: Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalabili

#benchmark #llms #autonomous-systems Read on arxiv →

arxivMay 25bullish

AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

arXiv:2601.17261v4 Announce Type: replace Abstract: Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically emp

QWPA2 models #optimization #llms #fine-tuning Read on arxiv →

arxivMay 25bullish

Task-Awareness Improves LLM Generations and Uncertainty

arXiv:2601.21500v2 Announce Type: replace Abstract: In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space

#llms #machine-learning #uncertainty-estimation Read on arxiv →

mit-tech-reviewMay 21bullish

Roundtables: Can AI Learn to Understand the World?

Listen to the session or watch below AI companies want to build systems that understand the external world and overcome the limitations of LLMs. Recent developments have brought world models to the forefront of the AI discussion. Watch a conversation with editor in chief Mat Honan, senior AI editor

#world-models #llms #ai-development Read on mit-tech-review →

arxivApr 13bullish

ConvoLearn: A Learning Sciences Grounded Dataset for Fine-Tuning Dialogic AI Tutors

arXiv:2601.08950v4 Announce Type: replace Abstract: Despite their growing adoption in education, LLMs remain misaligned with the core principle of effective tutoring: the dialogic construction of knowledge. We introduce ConvoLearn, a dataset of 2,134 semi-synthetic tutor-student dialogues operationa

MI1 model #education #dialogic #tutoring Read on arxiv →

Tag

#llms

11 articles tagged #llms

arxiv4d ago