AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
AsyncTLS is a hierarchical sparse attention system designed to address the challenges of long-context inference in large language models (LLMs). The dual challenges of quadratic attention complexity and prohibitive KV cache memory have hindered the efficiency of LLMs. Token-level sparse attention offers superior accuracy, but its indexing overhead is costly, while block-level methods improve efficiency but sacrifice precision.
AsyncTLS combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency. This approach is coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. The result is a system that achieves a balance between accuracy and efficiency, making it suitable for a wide range of applications.
Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering significant operator speedups and end-to-end throughput improvements. Specifically, AsyncTLS achieves 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts. This makes AsyncTLS a promising solution for efficient generative LLM inference.
The paper is available online, with a DOI of https://doi.org/10.48550/arXiv.2604.07815, and can be cited as arXiv:2604.07815 [cs.CL]. The submission history indicates that the paper was submitted by Yuxuan Hu on Thu, 9 Apr 2026 05:15:16 UTC, with a file size of 2,971 KB. The paper is classified under the subjects of Computation and Language (cs.CL) and has been assigned ACM classes of I.2.7.
No replies yet. Be first.