CSpace
Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference
Yu, Xiyue1,2,3; Bi, Jun2; Wen, Yuanbo2; Xu, Jianxing1,2,3; Huang, Di2; Guo, Jiaming2; Li, Wei2; Du, Zidong2; Li, Jing1; Chen, Tianshi3; Guo, Qi2
2025-12-01
发表期刊ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
ISSN1544-3566
卷号22期号:4页码:25
摘要Optimizing deep learning inference, particularly reducing the execution latency of tensor computations at small batch sizes, is crucial for the successful and widespread adoption of deep neural network (DNN) models. However, current deep learning compilers and hand-tuned libraries often fail to achieve high hardware efficiency when executing small-batch workloads. The primary reason is the inherently sequential nature of reductions (e.g., along the hidden dimension in the flattened GEMM for LLM decoding), which is difficult to parallelize and therefore fails to fully utilize available hardware resources. In this article, we propose Swift, a novel search-based approach for efficiently generating high-performance programs for GPUs by maximizing hardware utilization. The key insight is that reduction parallelization can be incorporated into a unified representation alongside the existing tile structure, significantly expanding the search space for high-performance programs. Concretely, by enumerating all possible parallel mappings of loops, we first generate a large search space that contains high-performance programs. Then, to efficiently explore the extended search space, we employ subspace shifting exploration to identify promising regions, effectively prune large portions of the less-promising search space. We conduct experiments on three distinct GPU architectures using a diverse set of benchmarks representative of typical application scenarios. Experimental results demonstrate that Swift achieves an average speedup of 1.19x over the state-of-the-art compiler-based approaches. Moreover, compared with vendor-provided hand-tuned libraries, Swift achieves an average speedup of 2.40x.
关键词Code generation compiler optimization tensor computation
DOI10.1145/3762660
收录类别SCI
语种英语
WOS研究方向Computer Science
WOS类目Computer Science, Hardware & Architecture ; Computer Science, Theory & Methods
WOS记录号WOS:001667658800001
出版者ASSOC COMPUTING MACHINERY
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/42846
专题中国科学院计算技术研究所
通讯作者Guo, Qi
作者单位1.Univ Sci & Technol China, Hefei, Peoples R China
2.Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing, Peoples R China
3.Cambricon Technol, Beijing, Peoples R China
推荐引用方式
GB/T 7714
Yu, Xiyue,Bi, Jun,Wen, Yuanbo,et al. Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,2025,22(4):25.
APA Yu, Xiyue.,Bi, Jun.,Wen, Yuanbo.,Xu, Jianxing.,Huang, Di.,...&Guo, Qi.(2025).Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference.ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,22(4),25.
MLA Yu, Xiyue,et al."Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference".ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 22.4(2025):25.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Yu, Xiyue]的文章
[Bi, Jun]的文章
[Wen, Yuanbo]的文章
百度学术
百度学术中相似的文章
[Yu, Xiyue]的文章
[Bi, Jun]的文章
[Wen, Yuanbo]的文章
必应学术
必应学术中相似的文章
[Yu, Xiyue]的文章
[Bi, Jun]的文章
[Wen, Yuanbo]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。