Tag

#inference

12 articles tagged #inference

arxiv4d ago

DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions

arXiv:2607.20469v1 Announce Type: new Abstract: Large language models (LLMs) handle many tasks with one set of parameters, but under KV-cached inference it is unclear what task-general structure, if any, is used at decode time rather than during prefill. We propose DecodeShare, a protocol that ident

#large-language-models #inference #decode-time Read on arxiv →

arxiv6d agobullish

Beyond Accuracy and Cost: Latency-Aware LLM Query Routing for Dynamic Workloads

arXiv:2607.18253v1 Announce Type: new Abstract: Modern language query routers improve inference efficiency by assigning each query to a model that balances response quality and monetary cost. However, current query routers are largely latency-agnostic and do not consider the generation latency exper

#optimization #latency #inference Read on arxiv →

arxivJul 20

How Much Human Label Variation Does Formal Semantic Structure Explain?: Group-Level Effects and Item-Level Ceilings in NLI

arXiv:2607.15870v1 Announce Type: new Abstract: Human label variation in natural language inference is increasingly treated as signal rather than noise, but how much of it formal semantic structure explains has not been measured directly. We measure it on the 3,113 SNLI and MNLI items of ChaosNLI, u

CHSNMN3 models #natural-language-processing #semantics #inference Read on arxiv →

arxivJul 16bullish

NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

arXiv:2505.18231v3 Announce Type: replace-cross Abstract: Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this

#machine-learning #vector-quantization #optimization Read on arxiv →

arxivJun 10bullish

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

arXiv:2606.10820v1 Announce Type: cross Abstract: Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion

K-TR2 models #language-modeling #acceleration #inference Read on arxiv →

arxivJun 5bullish

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

arXiv:2606.05557v1 Announce Type: new Abstract: A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AUR

AURE2 models #natural-language-processing #inference #benchmark Read on arxiv →

arxivJun 2bullish

Efficient Test-time Inference for Generative Planning Models

arXiv:2606.00618v1 Announce Type: new Abstract: Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more eff

GEHE2 models #planning #inference #optimization Read on arxiv →

arxivMay 14bullish

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

arXiv:2605.13784v1 Announce Type: new Abstract: Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-

VLSGTE3 models #streaming #inference #optimization Read on arxiv →

arxivMay 8bullish

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

arXiv:2605.05225v1 Announce Type: cross Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-

MIMA2 models #multimodal #efficiency #inference Read on arxiv →

arxivMay 6

E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

arXiv:2605.00955v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) equips large language models (LLMs) with external evidence by retrieving documents at inference time, but it also turns the retrieval corpusinto a sensitive asset. Under a black-box setting, an adversary given a c

RE1 model #security #language-models #inference Read on arxiv →

arxivApr 22bullish

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

arXiv:2603.16091v2 Announce Type: replace-cross Abstract: In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair lay

GPGPBA3 models #question answering #retrieval #inference Read on arxiv →

arxivApr 10bullish

FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling

arXiv:2604.06779v1 Announce Type: new Abstract: We introduce Fleming-Viot Diffusion (FVD), an inference-time alignment method that resolves the diversity collapse commonly observed in Sequential Monte Carlo (SMC) based diffusion samplers. Existing SMC-based diffusion samplers often rely on multinomi

#diffusion #monte-carlo #inference Read on arxiv →