arxivJul 14bullish

PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

arXiv:2505.18610v2 Announce Type: replace Abstract: Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large

LA1 model #quantization #compression #language-models Read on arxiv →

arxivJun 26

What Survives When You Compress a Recursive Reasoner for the Edge?

arXiv:2606.26488v1 Announce Type: new Abstract: Recursive reasoning models can solve complex structured tasks with only a few million parameters by repeatedly updating a latent state. Deploying these models on edge hardware requires significant compression, but unlike conventional sequence models, q

#compression #quantization #edge-hardware Read on arxiv →

arxivJun 18bullish

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

arXiv:2606.18304v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts o

DEQWQW4 models · +1 #compression #pruning #moe Read on arxiv →

arxivJun 10

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: cross Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignmen

MI1 model #quantization #safety #large-language-models Read on arxiv →

arxivJun 6bullish

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

arXiv:2606.05688v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, h

MI1 model #quantization #moe #foundation-models Read on arxiv →

arxivMay 29bullish

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

arXiv:2605.29843v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-base

LL1 model #quantization #machine learning #optimization Read on arxiv →

arxivMay 19bullish

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

arXiv:2511.06516v3 Announce Type: replace Abstract: Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over

LL1 model #quantization #mixed-precision #task-aware Read on arxiv →

arxivApr 24bullish

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

arXiv:2604.21649v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based appro

#knowledge-graph #natural-language-processing #quantization Read on arxiv →

arxivApr 21bullish

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

arXiv:2604.15356v1 Announce Type: cross Abstract: Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that ac

TU1 model #compression #quantization #transformers Read on arxiv →

arxivApr 9bullish

STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

arXiv:2604.06836v1 Announce Type: new Abstract: Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across layers and train

GPVI2 models #optimization #quantization #memory-reduction Read on arxiv →