arxivApril 17, 2026 at 4:00 AM1 min readneutral

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

View PDF HTML (experimental) Abstract:CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints--existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost. Comments: DAC '26 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL) Cite as: arXiv:2507.18454 [cs.AR] (or arXiv:2507.18454v2 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.18454 arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3770743.3803931 DOI(s) linking to related resources Submission history From: Juntao Zhao [view email] [v1] Mon, 19 May 2025 06:37:29 UTC (508 KB) [v2] Wed, 15 Apr 2026 08:07:38 UTC (453 KB)

Read original article ↗

No replies yet. Be first.

arxiv6h ago

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Related Articles

AMA: Adaptive Memory via Multi-Agent Collaboration

MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation