SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
View PDF HTML (experimental) Abstract:The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory. Comments: Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 13 pages, 4 figures, 4 tables Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 68T50, 90C15 ACM classes: I.2.6; I.2.7 Cite as: arXiv:2604.07663 [cs.LG] (or arXiv:2604.07663v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.07663 arXiv-issued DOI via DataCite (pending registration) Submission history From: Wooin Lee [view email] [v1] Thu, 9 Apr 2026 00:07:38 UTC (588 KB)
No replies yet. Be first.