arxiv4d ago

DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions

arXiv:2607.20469v1 Announce Type: new Abstract: Large language models (LLMs) handle many tasks with one set of parameters, but under KV-cached inference it is unclear what task-general structure, if any, is used at decode time rather than during prefill. We propose DecodeShare, a protocol that ident

#large-language-models #inference #decode-time Read on arxiv →

arxivJun 25

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

arXiv:2606.26027v1 Announce Type: new Abstract: Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use task

#machine-learning #reinforcement-learning #large-language-models Read on arxiv →

arxivJun 24bullish

Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

arXiv:2603.15510v2 Announce Type: replace Abstract: The synthesis of inductive loop invariants remains a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on complex programs, producing invariants that are

QWMEMI5 models · +2 #program-verification #large-language-models #fine-tuning Read on arxiv →

arxivJun 10

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: cross Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignmen

MI1 model #quantization #safety #large-language-models Read on arxiv →

arxivJun 5

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

arXiv:2606.04778v1 Announce Type: new Abstract: Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens.

#safety #large-language-models #vulnerability Read on arxiv →

arxivJun 2bullish

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

arXiv:2606.00532v1 Announce Type: new Abstract: Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that ca

KA1 model #context-engineering #large-language-models #mathematical-reasoning Read on arxiv →

arxivMay 29

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

arXiv:2509.23571v3 Announce Type: replace-cross Abstract: As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat an

#cybersecurity #threat-hunting #benchmark Read on arxiv →

arxivMay 28

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

arXiv:2605.01735v2 Announce Type: replace Abstract: As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information a

#unlearning #large-language-models #privacy Read on arxiv →

arxivMay 25bullish

Graph Alignment Topology as an Inductive Bias for Grounding Detection

arXiv:2605.22963v1 Announce Type: cross Abstract: Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does n

GP1 model #large-language-models #factuality #graph-neural-networks Read on arxiv →

arxivMay 25bullish

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

arXiv:2602.02780v3 Announce Type: replace Abstract: Large language models (LLMs) are enabling reasoning over 2D and 3D structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such archite

CU1 model #large-language-models #multimodal #reasoning Read on arxiv →

arxivMay 22bullish

Retrospective Sparse Attention for Efficient Long-Context Generation

arXiv:2508.09001v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory foot

#large-language-models #optimization #attention-mechanisms Read on arxiv →

arxivMay 22bullish

Spectra as Language: Large Language Models for Scalable Stellar Parameter and Abundance Inference

arXiv:2605.22162v1 Announce Type: cross Abstract: Stellar spectra encode key information on the physical properties and chemical compositions of stars. Accurate stellar parameter determination is essential for addressing major questions such as galaxy and stellar evolution. Large-scale spectroscopic

#astronomy #machine-learning #spectroscopy Read on arxiv →

arxivMay 14bullish

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

arXiv:2601.20255v2 Announce Type: replace-cross Abstract: SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supe

#benchmark #software-engineering #large-language-models Read on arxiv →

arxivMay 11bullish

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

arXiv:2605.07935v1 Announce Type: new Abstract: We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordinatio

#verification #multiagent #coordination Read on arxiv →

arxivMay 11bullish

End-to-end PDDL Planning with Hardcoded and Dynamic Agents

arXiv:2512.09629v2 Announce Type: replace Abstract: We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem

OPGPGP5 models · +2 #planning #natural-language-processing #large-language-models Read on arxiv →

arxivMay 8bullish

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

arXiv:2605.05225v1 Announce Type: cross Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-

MIMA2 models #multimodal #efficiency #inference Read on arxiv →

arxivMay 8bullish

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

arXiv:2601.20375v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practi

LL1 model #automl #data-processing #large-language-models Read on arxiv →

arxivMay 6

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

arXiv:2605.01392v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human expertise and judgme

CH1 model #software-engineering #large-language-models #design Read on arxiv →

arxivMay 5bullish

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arXiv:2605.00425v1 Announce Type: new Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it diffic

#reinforcement-learning #large-language-models #exploration-exploitation Read on arxiv →

arxivMay 1bullish

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

arXiv:2604.27467v1 Announce Type: cross Abstract: Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verificatio

#research #large-language-models #code-training Read on arxiv →

arxivMay 1

When Your LLM Reaches End-of-Life: A Framework for Confident Model Migration in Production Systems

arXiv:2604.27082v1 Announce Type: new Abstract: We present a framework for migrating production Large Language Model (LLM) based systems when the underlying model reaches end-of-life or requires replacement. The key contribution is a Bayesian statistical approach that calibrates automated evaluation

#migration #evaluation #large-language-models Read on arxiv →

arxivMay 1

Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

arXiv:2602.10140v2 Announce Type: replace-cross Abstract: Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports repli

GPCL2 models #large-language-models #code-generation #agent-based-models Read on arxiv →

arxivApr 27bullish

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

arXiv:2604.22199v1 Announce Type: cross Abstract: Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered

LL1 model #autonomous-robots #open-environments #large-language-models Read on arxiv →

arxivApr 24bullish

Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

arXiv:2506.12721v2 Announce Type: replace Abstract: Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To

#large-language-models #compute-optimization #bandit-learning Read on arxiv →

arxivApr 23

LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

arXiv:2604.20556v1 Announce Type: cross Abstract: Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and

TRGAMA3 models #large-language-models #architecture #interpretability Read on arxiv →

arxivApr 21

Using large language models for embodied planning introduces systematic safety risks

arXiv:2604.18463v1 Announce Type: cross Abstract: Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normativ

#safety #benchmark #robotics Read on arxiv →

arxivApr 21

To LLM, or Not to LLM: How Designers and Developers Navigate LLMs as Tools or Teammates

arXiv:2604.15344v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly integrated into design and development workflows, yet decisions about their use are rarely binary or purely technical. We report findings from a constructivist grounded theory study based on interviews wi

#human-computer-interaction #large-language-models #sociotechnical Read on arxiv →

arxivApr 21

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

arXiv:2604.15794v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantizat

#self-distillation #fine-tuning #large-language-models Read on arxiv →

arxivApr 20bullish

C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

arXiv:2604.15675v1 Announce Type: new Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these s

#cultural-alignment #large-language-models #data-synthesis Read on arxiv →

arxivApr 17

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

arXiv:2510.23853v3 Announce Type: replace Abstract: Large language model (LLM) agents are increasingly used to interact with and execute tasks in dynamic environments. However, a critical yet overlooked limitation of these agents is that they, by default, assume a stationary context, failing to acco

#temporal-awareness #large-language-models #human-aligned Read on arxiv →

arxivApr 17

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

arXiv:2604.13206v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstrea

#numerical-stability #large-language-models #transformer-architectures Read on arxiv →

arxivApr 10bullish

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

arXiv:2604.07663v1 Announce Type: new Abstract: The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedd

ME1 model #optimization #memory-efficiency #large-language-models Read on arxiv →

arxivApr 9bullish

An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and Applications

arXiv:2306.02781v3 Announce Type: replace-cross Abstract: Generative artificial intelligence, and large language models in particular, have emerged as one of the most transformative paradigms in modern computer science. This automated survey provides an accessible treatment of the field as of early

DEDEDE17 models · +14 #large-language-models #generative-ai #machine-learning Read on arxiv →

arxivApr 8bullish

Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

arXiv:2604.05808v1 Announce Type: new Abstract: Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limit

ST1 model #reinforcement-learning #hierarchical-learning #large-language-models Read on arxiv →

arxivApr 7

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

arXiv:2604.02368v2 Announce Type: replace Abstract: As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks su

#benchmark #evaluation #expert-level Read on arxiv →