arxivJul 14bullish

PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

arXiv:2505.18610v2 Announce Type: replace Abstract: Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large

LA1 model #quantization #compression #language-models Read on arxiv →

arxivJul 3bullish

The risk of KV cache compression

arXiv:2607.01520v1 Announce Type: new Abstract: Transformer inference on long sequences is expensive because softmax attention repeatedly reads from a large KV cache. The prevalent approach to this bottleneck is KV cache compression, which replaces the full cache with a compact summary. Despite its

#machine-learning #optimization #compression Read on arxiv →

arxivJul 3

Parameter Golf: What Really Works?

arXiv:2607.01517v1 Announce Type: new Abstract: How far can a language model improve under a strict artifact budget? Parameter Golf posed this question as an open community challenge in which participants trained the best language model, with the complete artifact (training code + compressed weights

#optimization #language-models #benchmark Read on arxiv →

arxivJun 26

What Survives When You Compress a Recursive Reasoner for the Edge?

arXiv:2606.26488v1 Announce Type: new Abstract: Recursive reasoning models can solve complex structured tasks with only a few million parameters by repeatedly updating a latent state. Deploying these models on edge hardware requires significant compression, but unlike conventional sequence models, q

#compression #quantization #edge-hardware Read on arxiv →

arxivJun 18bullish

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

arXiv:2606.18304v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts o

DEQWQW4 models · +1 #compression #pruning #moe Read on arxiv →

arxivJun 10bullish

Few-step Generative Models as Lossy Compression

arXiv:2606.10450v1 Announce Type: cross Abstract: DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative mo

DIRECO5 models · +2 #compression #diffusion #generative models Read on arxiv →

arxivJun 2bullish

AdaCodec: A Predictive Visual Code for Video MLLMs

arXiv:2606.02569v1 Announce Type: cross Abstract: Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens

ADQW2 models #video #multimodal #compression Read on arxiv →

arxivApr 21bullish

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

arXiv:2604.15356v1 Announce Type: cross Abstract: Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that ac

TU1 model #compression #quantization #transformers Read on arxiv →

arxivApr 8bullish

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

arXiv:2604.05887v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to

QW1 model #multimodal #compression #optimization Read on arxiv →