Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Source

arxiv.orgfull article ↗

Publisher summary· verbatim

arXiv:2509.10406v4 Announce Type: replace Abstract: Pretraining transformers on long sequences (entire code repositories, collections of related documents) is bottlenecked by quadratic attention costs. We present Multipole Semantic Attention (MuSe), which accelerates 64k-context pretraining by 36% w

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

Discussion

No replies yet. Be first.

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Related coverage

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Related coverage