·
DataBubble
  • Home
  • Models
  • News
  • Compare
  • Boards
  • Pricing
  • About
  • Newsletter
  • Methodology
  • Contact
Latest
Theker just raised $85M to build the factory robot that doesn’t specialize in anything1h◆Jeff Bezos’s Prometheus raises $12B to build an ‘artificial general engineer’ for the physical world1h◆SpaceX officially prices shares at $135 in the largest IPO ever6h◆Our new community investments in Virginia support local jobs and expand energy affordability.6h◆SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift6h◆Amazon’s data centers used 2.5 billion gallons of water last year9h◆Deezer’s new tool can identify AI music from Spotify, Apple Music, and others10h◆Pool’s new app turns your screenshots into something useful11h◆DoorDash’s new AI chatbot lets you order with prompts and photos12h◆Anthropic apologizes for invisible Claude Fable guardrails15h◆Google DeepMind is worried about what happens when millions of agents start to interact15h◆Deezer launches an AI music detector for other streaming services18h◆Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing22h◆MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning22h◆Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!22h◆ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation22h◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions22h◆The Impossibility of Eliciting Latent Knowledge22h◆Mapping Scientific Literature with Large Language Models and Topic Modeling22h◆Grounding Computer Use Agents on Human Demonstrations22h◆Theker just raised $85M to build the factory robot that doesn’t specialize in anything1h◆Jeff Bezos’s Prometheus raises $12B to build an ‘artificial general engineer’ for the physical world1h◆SpaceX officially prices shares at $135 in the largest IPO ever6h◆Our new community investments in Virginia support local jobs and expand energy affordability.6h◆SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift6h◆Amazon’s data centers used 2.5 billion gallons of water last year9h◆Deezer’s new tool can identify AI music from Spotify, Apple Music, and others10h◆Pool’s new app turns your screenshots into something useful11h◆DoorDash’s new AI chatbot lets you order with prompts and photos12h◆Anthropic apologizes for invisible Claude Fable guardrails15h◆Google DeepMind is worried about what happens when millions of agents start to interact15h◆Deezer launches an AI music detector for other streaming services18h◆Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing22h◆MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning22h◆Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!22h◆ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation22h◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions22h◆The Impossibility of Eliciting Latent Knowledge22h◆Mapping Scientific Literature with Large Language Models and Topic Modeling22h◆Grounding Computer Use Agents on Human Demonstrations22h◆
Tag

#language-models

94 articles tagged #language-models

arxiv22h agobullish

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

arXiv:2606.04694v2 Announce Type: replace Abstract: Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framewor

DU1 model#multilingual#distillation#language-modelsRead on arxiv →
arxiv1d agobullish

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

arXiv:2606.11119v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, aris

QW1 model#reinforcement-learning#language-models#optimizationRead on arxiv →
arxiv1d agobullish

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

arXiv:2606.09866v1 Announce Type: cross Abstract: Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our diagnostics show ta

LL1 model#safety#fine-tuning#language-modelsRead on arxiv →
arxiv1d ago

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

arXiv:2606.10298v1 Announce Type: new Abstract: When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm tha

#reliability#language-models#evaluationRead on arxiv →
arxiv1d ago

Advancing the State-of-the-Art in Empirical Privacy Auditing

arXiv:2606.10481v1 Announce Type: cross Abstract: Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (M

#privacy#language-models#auditingRead on arxiv →
arxiv1d agobullish

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

arXiv:2602.12424v2 Announce Type: replace-cross Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to

#evaluation#benchmark#language-modelsRead on arxiv →
arxiv5d ago

A Systematic Analysis of Biases in Large Language Models

arXiv:2512.15792v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and resp

#fairness#bias#language-modelsRead on arxiv →
arxiv5d agobullish

FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

arXiv:2606.05644v1 Announce Type: new Abstract: When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditi

FI1 model#retrieval-augmentation#contrastive-decoding#language-modelsRead on arxiv →
arxiv5d agobearish

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

arXiv:2504.10020v4 Announce Type: replace-cross Abstract: Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the outp

#multimodal#hallucinations#language-modelsRead on arxiv →
arxiv6d agobearish

Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

arXiv:2604.23600v2 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditi

LL1 model#bias#language-models#stereotypesRead on arxiv →
arxivJun 4bullish

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

arXiv:2606.04694v1 Announce Type: new Abstract: Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework th

DU1 model#multilingual#distillation#language-modelsRead on arxiv →
arxivJun 3bullish

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most info

FI1 model#on-policy#distillation#optimizationRead on arxiv →
arxivJun 3bullish

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

arXiv:2606.03113v1 Announce Type: new Abstract: Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization

MEME2 models#optimization#reinforcement-learning#language-modelsRead on arxiv →
arxivJun 3bullish

Coherence Maximization Improves Pluralistic Alignment

arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these examples effective, u

IN1 model#value-alignment#unsupervised-learning#language-modelsRead on arxiv →
arxivJun 2bullish

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

arXiv:2606.01230v1 Announce Type: new Abstract: Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and p

HOHOGP3 models#smart-home#language-models#benchmarkRead on arxiv →
arxivJun 2bullish

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

arXiv:2605.12969v3 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy

GRCO2 models#reinforcement-learning#language-models#optimizationRead on arxiv →
arxivJun 2bullish

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

arXiv:2510.05342v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing over

DIIP$\4 models · +1#machine-learning#optimization#language-modelsRead on arxiv →
arxivJun 1bullish

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

arXiv:2605.31183v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs

SPLALO3 models#language-models#benchmark#interpretabilityRead on arxiv →
arxivJun 1

LLM Anonymization Against Agentic Re-Identificatio

arXiv:2605.30848v1 Announce Type: cross Abstract: Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become cross-referenceable evidence for re-identification, yet those same details also carry downstream analytic value of the text. Existing defense

AU1 model#anonymization#privacy#securityRead on arxiv →
arxivMay 29bullish

WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models

arXiv:2512.00837v2 Announce Type: replace Abstract: Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Exi

LA1 model#watermarking#language-models#securityRead on arxiv →
arxivMay 29bullish

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

arXiv:2605.28919v1 Announce Type: cross Abstract: Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We p

CO1 model#compact-models#reasoning#autoregressiveRead on arxiv →
arxivMay 29bullish

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

arXiv:2605.28864v1 Announce Type: new Abstract: The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matc

COGPGP3 models#language-models#transformers#cognitive-scienceRead on arxiv →
arxivMay 29

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

arXiv:2605.29007v1 Announce Type: new Abstract: Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate synthetic err

LL1 model#education#synthetic-data#language-modelsRead on arxiv →
arxivMay 29bullish

Unlocking the Working Memory of Large Language Models for Latent Reasoning

arXiv:2605.30343v1 Announce Type: cross Abstract: To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates intern

#reasoning#language-models#working-memoryRead on arxiv →
arxivMay 29bullish

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

arXiv:2601.08654v2 Announce Type: replace-cross Abstract: Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem:

#evaluation#language-models#rubric-scoringRead on arxiv →
arxivMay 29

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy a

#evaluation#interpretability#language-modelsRead on arxiv →
arxivMay 25

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

arXiv:2605.23190v1 Announce Type: new Abstract: Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MG

LA1 model#machine-generated-texts#detection#language-modelsRead on arxiv →
arxivMay 22bullish

Token-weighted Direct Preference Optimization with Attention

arXiv:2605.21883v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existi

LA1 model#optimization#language-models#reinforcement-learningRead on arxiv →
arxivMay 22

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

arXiv:2605.20744v1 Announce Type: cross Abstract: Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Rewar

LA1 model#reward-hacking#evaluation#autonomous-agentsRead on arxiv →
arxivMay 22bullish

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

arXiv:2605.20199v1 Announce Type: cross Abstract: We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high qual

FL1 model#language-models#diffusion#fine-tuningRead on arxiv →
arxivMay 19bullish

EmoMind: Decoding Affective Captions from Human Brain fMRI

arXiv:2605.16739v1 Announce Type: cross Abstract: Decoding visual experience from brain activity has advanced substantially, but cur- rent brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with cate

EMOP2 models#neuroscience#affective-computing#brain-decodingRead on arxiv →
arxivMay 18

Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis

arXiv:2605.15440v1 Announce Type: new Abstract: Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surpr

RE1 model#language-models#sentence-processing#syntactic-ambiguityRead on arxiv →
arxivMay 16bearish

Quantifying and Mitigating Premature Closure in Frontier LLMs

arXiv:2605.15000v1 Announce Type: cross Abstract: Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate c

LL1 model#safety#evaluation#language-modelsRead on arxiv →
arxivMay 16

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

arXiv:2605.14857v1 Announce Type: new Abstract: Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, an

QWQW2 models#tariff-classification#language-models#expert-systemsRead on arxiv →
arxivMay 15

GradShield: Alignment Preserving Finetuning

arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligne

#safety#finetuning#language-modelsRead on arxiv →
arxivMay 15bullish

A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

arXiv:2506.11067v3 Announce Type: replace Abstract: Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS section from the clinical note using

MEGEMI4 models · +1#healthcare#language-models#open-sourceRead on arxiv →
arxivMay 11

How Value Induction Reshapes LLM Behaviour

arXiv:2605.07925v1 Announce Type: new Abstract: Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility,

#language-models#value-induction#safetyRead on arxiv →
arxivMay 11

Searching for Privacy Risks in LLM Agents via Simulation

arXiv:2508.10880v3 Announce Type: replace-cross Abstract: The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. However, the evolving nature of such

LL1 model#privacy#security#language-modelsRead on arxiv →
arxivMay 11bullish

Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

arXiv:2605.06885v1 Announce Type: cross Abstract: Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregr

DIAU2 models#diffusion#language-models#representation-learningRead on arxiv →
arxivMay 8bullish

ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild

arXiv:2512.06721v2 Announce Type: replace Abstract: Recent studies have begun to explore proactive large language model (LLM) agents that provide unobtrusive assistance by automatically leveraging contextual information, such as in code editing and in-app suggestions. However, most focus on short, t

PR1 model#proactive-assistance#language-models#human-computer-interactionRead on arxiv →
arxivMay 8

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are bri

#language-models#monitoring#safetyRead on arxiv →
arxivMay 8bullish

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

arXiv:2605.05927v1 Announce Type: new Abstract: Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech ge

TEWH2 models#speech-processing#language-models#modality-gapRead on arxiv →
arxivMay 7bullish

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

arXiv:2604.27201v2 Announce Type: replace Abstract: Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduce

PAQW2 models#language-models#architecture#hybrid-thinkingRead on arxiv →
arxivMay 7

Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

arXiv:2605.00364v2 Announce Type: replace Abstract: Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only

LLTOWM3 models#machine-unlearning#language-models#privacyRead on arxiv →
arxivMay 6

E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

arXiv:2605.00955v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) equips large language models (LLMs) with external evidence by retrieving documents at inference time, but it also turns the retrieval corpusinto a sensitive asset. Under a black-box setting, an adversary given a c

RE1 model#security#language-models#inferenceRead on arxiv →
arxivMay 5

Compute Optimal Tokenization

arXiv:2605.01188v1 Announce Type: new Abstract: Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tok

BL1 model#tokenization#language-models#scaling-lawsRead on arxiv →
arxivMay 5bearish

Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs

arXiv:2605.01224v1 Announce Type: new Abstract: This paper argues that contemporary multilingual NLP has converged on a fragile and misleading paradigm of incidental multilingualism. Today's LLMs appear multilingual largely because they are trained on massive, uneven web corpora, not because multili

LL1 model#nlp#multilingualism#language-modelsRead on arxiv →
arxivMay 5bullish

Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

arXiv:2605.01372v1 Announce Type: new Abstract: Large language models (LLMs) have been widely explored for embedding generation. While recent studies show that in-context learning (ICL) effectively enhances the representational capability of LLMs by prepending a few task-related demonstrations, it c

#embedding#in-context-learning#language-modelsRead on arxiv →
arxivMay 1

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

arXiv:2604.27019v1 Announce Type: cross Abstract: Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbreak robustness, yet do

#safety#language-models#adversarial-trainingRead on arxiv →
arxivMay 1

Geometry-Calibrated Conformal Abstention for Language Models

arXiv:2604.27914v1 Announce Type: new Abstract: When language models lack relevant knowledge for a given query, they frequently generate plausible responses that can be hallucinations, rather than admitting being agnostic about the answer. Retraining models to reward admitting ignorance can lead to

#conformal-prediction#language-models#calibrationRead on arxiv →
arxivMay 1

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

arXiv:2604.27996v1 Announce Type: new Abstract: This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction parad

#scientific-visualization#language-models#human-computer-interactionRead on arxiv →
arxivMay 1bullish

Proactive Dialogue Model with Intent Prediction

arXiv:2604.27379v1 Announce Type: new Abstract: Dialogue models are inherently reactive, responding to the current user turn without anticipating upcoming intents, which leads to redundant interactions in multi-intent settings. We address this limitation by introducing a lightweight intent-transitio

TE1 model#dialogue-systems#intent-recognition#language-modelsRead on arxiv →
arxivApr 30

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

arXiv:2604.26269v1 Announce Type: cross Abstract: The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space collapses into a narrow region, and the surviving choices look least predictable from an unconstraine

LL1 model#creative-writing#language-models#evaluationRead on arxiv →
arxivApr 30

Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

arXiv:2604.26148v1 Announce Type: cross Abstract: AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond

VIAN2 models#ui-interpretation#animation#human-computer-interactionRead on arxiv →
arxivApr 30bullish

A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models

arXiv:2604.26351v1 Announce Type: new Abstract: Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such as reading times. However, it remains unclear whether such constraints similarly affect sentence comp

GPO3O43 models#language-models#cognitive-resources#sentence-comprehensionRead on arxiv →
arxivApr 30bullish

Test-Time Safety Alignment

arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion mode

#safety#language-models#optimizationRead on arxiv →
arxivApr 30

Differentially-Private Text Rewriting reshapes Linguistic Style

arXiv:2604.26656v1 Announce Type: new Abstract: Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing form

#differential-privacy#language-models#text-rewritingRead on arxiv →
arxivApr 29bullish

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

arXiv:2604.24544v1 Announce Type: new Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, re

LATG2 models#benchmark#evaluation#language-modelsRead on arxiv →
arxivApr 27bearish

The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

arXiv:2601.12979v3 Announce Type: replace Abstract: The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gain

LLDR2 models#diffusion-based#language-models#agentic-interactionRead on arxiv →
arxivApr 24bullish

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

arXiv:2604.21357v1 Announce Type: new Abstract: This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, includi

RE1 model#geocoding#language-models#reinforcement-learningRead on arxiv →
arxivApr 24

MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv:2604.21916v1 Announce Type: new Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. W

#benchmark#evaluation#language-modelsRead on arxiv →
arxivApr 24bullish

DWTSumm: Discrete Wavelet Transform for Document Summarization

arXiv:2604.21070v1 Announce Type: new Abstract: Summarizing long, domain-specific documents with large language models (LLMs) remains challenging due to context limitations, information loss, and hallucinations, particularly in clinical and legal settings. We propose a Discrete Wavelet Transform (DW

GPBE2 models#summarization#domain-specific#language-modelsRead on arxiv →
arxivApr 23bullish

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

arXiv:2308.03303v2 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) is crucial for improving their performance on downstream tasks, but full-parameter fine-tuning (Full-FT) is computationally expensive and memory-intensive. Parameter-efficient fine-tuning (PEFT) methods, suc

LOLO2 models#fine-tuning#language-models#optimizationRead on arxiv →
arxivApr 23

Knowledge Capsules: Structured Nonparametric Memory Units for LLMs

arXiv:2604.20487v1 Announce Type: cross Abstract: Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely

#research#language-models#knowledge-retrievalRead on arxiv →
arxivApr 23

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

arXiv:2604.16902v2 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this g

#research#language-models#multimodalRead on arxiv →
arxivApr 21bullish

VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

arXiv:2510.27617v2 Announce Type: replace Abstract: Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametri

LA1 model#hardware-design#language-models#automationRead on arxiv →
arxivApr 20bullish

CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents

arXiv:2604.15802v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that

LARA2 models#retrieval#language-models#information-retrievalRead on arxiv →
arxivApr 18

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

arXiv:2604.07941v2 Announce Type: replace-cross Abstract: Post-training has become central to turning pretrained large language models (LLMs) into aligned, capable, and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), proce

#post-training#language-models#surveyRead on arxiv →
arxivApr 18

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

arXiv:2604.14258v1 Announce Type: cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a tra

#language-models#fine-tuning#reinforcement-learningRead on arxiv →
arxivApr 18bullish

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

arXiv:2604.14267v1 Announce Type: new Abstract: Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agent

LLQWQW3 models#machine-learning#reinforcement-learning#search-agentsRead on arxiv →
arxivApr 17bullish

ExpSeek: Self-Triggered Experience Seeking for Web Agents

arXiv:2601.08605v2 Announce Type: replace-cross Abstract: Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience p

QWQW4B3 models#experience-intervention#web-agents#benchmarkRead on arxiv →
arxivApr 17

Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

arXiv:2604.14180v1 Announce Type: new Abstract: We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigat

TR1 model#language-models#uncertainty#metacognitionRead on arxiv →
arxivApr 17bullish

Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions

arXiv:2502.16761v2 Announce Type: replace Abstract: Large language models (LLMs) present novel opportunities in public opinion research by predicting survey responses in advance during the early stages of survey design. Prior methods steer LLMs via descriptions of subpopulations as LLMs' input promp

#survey-research#language-models#fine-tuningRead on arxiv →
arxivApr 17bullish

Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

arXiv:2603.13683v2 Announce Type: replace Abstract: Although debiased large language models (LLMs) excel at handling known or low-bias prompts, they often fail on unfamiliar and high-bias prompts. We demonstrate via out-of-distribution (OOD) detection that these high-bias prompts cause a distributio

#debiasing#optimization#language-modelsRead on arxiv →
arxivApr 17bullish

Training-Free Test-Time Contrastive Learning for Large Language Models

arXiv:2604.13552v1 Announce Type: cross Abstract: Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need s

#adaptation#reasoning#language-modelsRead on arxiv →
arxivApr 17

Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration

arXiv:2604.13705v1 Announce Type: cross Abstract: Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a contro

RA1 model#fairness#multiagent#language-modelsRead on arxiv →
arxivApr 17

Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

arXiv:2604.14210v1 Announce Type: new Abstract: A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching t

MIGL2 models#language-models#efficiency#benchmarkRead on arxiv →
arxivApr 17bullish

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

arXiv:2604.14339v1 Announce Type: new Abstract: Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at th

MEQW2 models#long-context#language-models#self-distillationRead on arxiv →
arxivApr 17bullish

XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

arXiv:2604.05242v2 Announce Type: replace Abstract: Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, ex

LA1 model#watermarking#language-models#cryptographyRead on arxiv →
arxivApr 17

Rhetorical Questions in LLM Representations: A Linear Probing Study

arXiv:2604.14128v1 Announce Type: cross Abstract: Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-med

LL1 model#language-models#rhetorical-questions#natural-language-processingRead on arxiv →
arxivApr 16bullish

METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

arXiv:2604.11427v2 Announce Type: replace-cross Abstract: Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and pla

ME1 model#dialogue-agents#language-models#scalabilityRead on arxiv →
arxivApr 16

Variation in Verification: Understanding Verification Dynamics in Large Language Models

arXiv:2509.17995v2 Announce Type: replace-cross Abstract: Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators produ

GPGEGE3 models#test-time-scaling#language-models#verificationRead on arxiv →
arxivApr 16bullish

Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?

arXiv:2604.13077v1 Announce Type: new Abstract: Coronary angiography (CAG) reports contain clinically relevant physiological measurements, yet this information is typically in the form of unstructured natural language, limiting its use in research. We investigate the use of Large Language Models (LL

MEGPME4 models · +1#medical-text#language-models#information-extractionRead on arxiv →
arxivApr 14bullish

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

arXiv:2604.10520v1 Announce Type: cross Abstract: As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designe

#evaluation#code-summarization#language-modelsRead on arxiv →
arxivApr 14bullish

ACE-TA: An Agentic Teaching Assistant for Grounded Q&A, Quiz Generation, and Code Tutoring

arXiv:2604.09572v1 Announce Type: cross Abstract: We introduce ACE-TA, the Agentic Coding and Explanations Teaching Assistant framework, that autonomously routes conceptual queries drawn from programming course material to grounded Q&A, stepwise coding guidance, and automated quiz generation using p

LA1 model#education#programming#language-modelsRead on arxiv →
arxivApr 11bullish

TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

arXiv:2604.07960v1 Announce Type: cross Abstract: Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably,

LA1 model#cad#language-models#autonomous-systemsRead on arxiv →
arxivApr 11bullish

Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

arXiv:2604.08260v1 Announce Type: new Abstract: Knowledge Tracing (KT) aims to predict learners' future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem s

BA1 model#education#knowledge-tracing#language-modelsRead on arxiv →
arxivApr 10bearish

Benchmarking LLM Tool-Use in the Wild

arXiv:2604.06185v1 Announce Type: cross Abstract: Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behav

LA1 model#human-computer-interaction#language-models#benchmarkRead on arxiv →
arxivApr 9

Continuous Interpretive Steering for Scalar Diversity

arXiv:2604.07006v1 Announce Type: new Abstract: Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. Howe

LA1 model#pragmatic-inference#language-models#interpretabilityRead on arxiv →
arxivApr 8bullish

Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA

arXiv:2604.05387v1 Announce Type: cross Abstract: Large language models (LLMs) have been incorporated into numerous industrial applications. Meanwhile, a vast array of API assets is scattered across various functions in the financial domain. An online financial question-answering system can leverage

LA1 model#financial-qa#language-models#data-augmentationRead on arxiv →
arxivApr 8

Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

arXiv:2604.05536v1 Announce Type: cross Abstract: Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuation

TR1 model#language-models#natural-language-processing#complex-systemsRead on arxiv →
arxivApr 4bullish

Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

arXiv:2604.01538v1 Announce Type: new Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a t

GAME2 models#open-source#clinical#domain-adaptationRead on arxiv →
arxivApr 3bearish

How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models

arXiv:2511.06676v2 Announce Type: replace Abstract: Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flag

UN1 model#bias#fairness#language-modelsRead on arxiv →
arxivApr 3bearish

Can LLMs Perceive Time? An Empirical Investigation

arXiv:2604.00010v1 Announce Type: cross Abstract: Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p < 0.001$), with mod

GP1 model#language-models#benchmark#safetyRead on arxiv →
HomeModelsNews