Institute of Computing Technology, Chinese Academy IR
| ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads | |
| Hu, Cunchen1,2; Huang, Heyang1,2; Xu, Liangliang3; Chen, Xusheng4; Wang, Chenxi1,2; Xu, Jiang4; Chen, Shuang4; Feng, Hao4; Wang, Sa1,2; Bao, Yungang1; Sun, Ninghui1,2; Shan, Yizhou4 | |
| 2025-06-01 | |
| 发表期刊 | ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
![]() |
| ISSN | 1544-3566 |
| 卷号 | 22期号:2页码:24 |
| 摘要 | Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in ShuffleInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computation-saturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that Shuffle-Infer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in terms of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively. |
| 关键词 | LLM serving disaggregated interference schedule |
| DOI | 10.1145/3732941 |
| 收录类别 | SCI |
| 语种 | 英语 |
| 资助项目 | Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320000] ; Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320300] ; National Natural Science Foundation of China[62090022] ; National Natural Science Foundation of China[U24B6012] ; National Natural Science Foundation of China[62172388] ; China Postdoctoral Science Foundation[2024M762550] ; Shaanxi Postdoctoral Research Foundation[2024BSHSDZZ102] |
| WOS研究方向 | Computer Science |
| WOS类目 | Computer Science, Hardware & Architecture ; Computer Science, Theory & Methods |
| WOS记录号 | WOS:001533499400010 |
| 出版者 | ASSOC COMPUTING MACHINERY |
| 引用统计 | |
| 文献类型 | 期刊论文 |
| 条目标识符 | http://119.78.100.204/handle/2XEOYT63/42075 |
| 专题 | 中国科学院计算技术研究所期刊论文_英文 |
| 通讯作者 | Wang, Sa; Shan, Yizhou |
| 作者单位 | 1.Chinese Acad Sci, State Key Lab Processors, Inst Comp Technol, Beijing, Peoples R China 2.Univ Chinese Acad Sci, Beijing, Peoples R China 3.Xidian Univ, Inst Math & Interdisciplinary Sci, Xian, Peoples R China 4.Huawei Cloud, Hangzhou, Peoples R China |
| 推荐引用方式 GB/T 7714 | Hu, Cunchen,Huang, Heyang,Xu, Liangliang,et al. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,2025,22(2):24. |
| APA | Hu, Cunchen.,Huang, Heyang.,Xu, Liangliang.,Chen, Xusheng.,Wang, Chenxi.,...&Shan, Yizhou.(2025).ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,22(2),24. |
| MLA | Hu, Cunchen,et al."ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads".ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 22.2(2025):24. |
| 条目包含的文件 | 条目无相关文件。 | |||||
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论