arxivApril 11, 2026 at 4:00 AM2 min readneutral

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

View PDF HTML (experimental) Abstract:The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8$\times$A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to $2.8\times$ and $2.0\times$ in the prefill and decode stages. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2604.07394 [cs.LG] (or arXiv:2604.07394v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.07394 arXiv-issued DOI via DataCite Submission history From: Quantong Qiu [view email] [v1] Wed, 8 Apr 2026 07:36:17 UTC (415 KB)

Read original article ↗

No replies yet. Be first.

arxiv7h ago

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Related Articles

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?