Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
View PDF HTML (experimental) Abstract:CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints--existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost. Comments: DAC '26 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL) Cite as: arXiv:2507.18454 [cs.AR] (or arXiv:2507.18454v2 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.18454 arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3770743.3803931 DOI(s) linking to related resources Submission history From: Juntao Zhao [view email] [v1] Mon, 19 May 2025 06:37:29 UTC (508 KB) [v2] Wed, 15 Apr 2026 08:07:38 UTC (453 KB)
No replies yet. Be first.