CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Low-Latency PIM Accelerator for Edge LLM Inference
Wang, Xinyu1,2; Sun, Xiaotian1,2,3; Li, Wanqian1,2; Min, Feng3; Zhang, Xiaoyu3; Zhang, Xinjiang3; Han, Yinhe3; Chen, Xiaoming3
2025-07-01
发表期刊IEEE COMPUTER ARCHITECTURE LETTERS
ISSN1556-6056
卷号24期号:2页码:321-324
摘要Deploying large language models (LLMs) on edge devices has the potentials for low-latency inference and privacy protection. However, meeting the substantial bandwidth demands of latency-oriented edge devices is challenging due to the strict power constraints of edge devices. Resistive random-access memory (RRAM)-based processing-in-memory (PIM) is an ideal solution for this challenge, thanks to its low read power and high internal bandwidth. Moreover, applying quantization methods, which require different precisions for weights and activations, is a common practice in edge inference. But existing accelerators cannot fully leverage the benefits of quantization, as they lack multiply-accumulate (MAC) units optimized for mixed-precision operands. To achieve low-latency edge inference, we design an RRAM-based PIM die that integrates dedicated energy-efficient MAC units, providing both computation and storage capabilities. Coupled with a dynamic random-access memory (DRAM) die for storing the key-value (KV) cache, we propose Lyla, an accelerator for low-latency edge LLM inference. Experimental results show that Lyla achieves 3.8x, 2.4x, and 1.2x latency improvements over a GPU and two DRAM-based PIM accelerators, respectively.
关键词Random access memory Low latency communication Engines Bandwidth Vectors Registers Quantization (signal) Energy efficiency Hardware Computational modeling Large language model inference processing-in-memory edge accelerator
DOI10.1109/LCA.2025.3618104
收录类别SCI
语种英语
资助项目National Natural Science Foundation of China[62488101] ; National Natural Science Foundation of China[62495104] ; National Natural Science Foundation of China[62025404] ; Youth Innovation Promotion Association CAS
WOS研究方向Computer Science
WOS类目Computer Science, Hardware & Architecture
WOS记录号WOS:001600730100005
出版者IEEE COMPUTER SOC
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/41587
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Chen, Xiaoming
作者单位1.Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
2.Chinese Acad Sci, Univ Chinese Acad Sci, State Key Lab Processors, Beijing 101408, Peoples R China
3.Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Wang, Xinyu,Sun, Xiaotian,Li, Wanqian,et al. Low-Latency PIM Accelerator for Edge LLM Inference[J]. IEEE COMPUTER ARCHITECTURE LETTERS,2025,24(2):321-324.
APA Wang, Xinyu.,Sun, Xiaotian.,Li, Wanqian.,Min, Feng.,Zhang, Xiaoyu.,...&Chen, Xiaoming.(2025).Low-Latency PIM Accelerator for Edge LLM Inference.IEEE COMPUTER ARCHITECTURE LETTERS,24(2),321-324.
MLA Wang, Xinyu,et al."Low-Latency PIM Accelerator for Edge LLM Inference".IEEE COMPUTER ARCHITECTURE LETTERS 24.2(2025):321-324.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Wang, Xinyu]的文章
[Sun, Xiaotian]的文章
[Li, Wanqian]的文章
百度学术
百度学术中相似的文章
[Wang, Xinyu]的文章
[Sun, Xiaotian]的文章
[Li, Wanqian]的文章
必应学术
必应学术中相似的文章
[Wang, Xinyu]的文章
[Sun, Xiaotian]的文章
[Li, Wanqian]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。