CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads
Hu, Cunchen1,2; Huang, Heyang1,2; Xu, Liangliang3; Chen, Xusheng4; Wang, Chenxi1,2; Xu, Jiang4; Chen, Shuang4; Feng, Hao4; Wang, Sa1,2; Bao, Yungang1; Sun, Ninghui1,2; Shan, Yizhou4
2025-06-01
发表期刊ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
ISSN1544-3566
卷号22期号:2页码:24
摘要Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in ShuffleInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computation-saturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that Shuffle-Infer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in terms of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.
关键词LLM serving disaggregated interference schedule
DOI10.1145/3732941
收录类别SCI
语种英语
资助项目Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320000] ; Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320300] ; National Natural Science Foundation of China[62090022] ; National Natural Science Foundation of China[U24B6012] ; National Natural Science Foundation of China[62172388] ; China Postdoctoral Science Foundation[2024M762550] ; Shaanxi Postdoctoral Research Foundation[2024BSHSDZZ102]
WOS研究方向Computer Science
WOS类目Computer Science, Hardware & Architecture ; Computer Science, Theory & Methods
WOS记录号WOS:001533499400010
出版者ASSOC COMPUTING MACHINERY
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/42075
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Wang, Sa; Shan, Yizhou
作者单位1.Chinese Acad Sci, State Key Lab Processors, Inst Comp Technol, Beijing, Peoples R China
2.Univ Chinese Acad Sci, Beijing, Peoples R China
3.Xidian Univ, Inst Math & Interdisciplinary Sci, Xian, Peoples R China
4.Huawei Cloud, Hangzhou, Peoples R China
推荐引用方式
GB/T 7714
Hu, Cunchen,Huang, Heyang,Xu, Liangliang,et al. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,2025,22(2):24.
APA Hu, Cunchen.,Huang, Heyang.,Xu, Liangliang.,Chen, Xusheng.,Wang, Chenxi.,...&Shan, Yizhou.(2025).ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,22(2),24.
MLA Hu, Cunchen,et al."ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads".ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 22.2(2025):24.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Hu, Cunchen]的文章
[Huang, Heyang]的文章
[Xu, Liangliang]的文章
百度学术
百度学术中相似的文章
[Hu, Cunchen]的文章
[Huang, Heyang]的文章
[Xu, Liangliang]的文章
必应学术
必应学术中相似的文章
[Hu, Cunchen]的文章
[Huang, Heyang]的文章
[Xu, Liangliang]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。