arxivApril 11, 2026 at 4:00 AM2 min readneutral

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

View PDF HTML (experimental) Abstract:The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures). Comments: 20 pages, 6 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.04759 [cs.CL] (or arXiv:2603.04759v2 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.04759 arXiv-issued DOI via DataCite Submission history From: Wei Han [view email] [v1] Thu, 5 Mar 2026 03:16:16 UTC (569 KB) [v2] Thu, 9 Apr 2026 15:16:58 UTC (563 KB)

Read original article ↗

No replies yet. Be first.

arxiv6h ago

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

Related Articles

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?