Institute of Computing Technology, Chinese Academy IR
| Low-Latency PIM Accelerator for Edge LLM Inference | |
| Wang, Xinyu1,2; Sun, Xiaotian1,2,3; Li, Wanqian1,2; Min, Feng3; Zhang, Xiaoyu3; Zhang, Xinjiang3; Han, Yinhe3; Chen, Xiaoming3 | |
| 2025-07-01 | |
| 发表期刊 | IEEE COMPUTER ARCHITECTURE LETTERS
![]() |
| ISSN | 1556-6056 |
| 卷号 | 24期号:2页码:321-324 |
| 摘要 | Deploying large language models (LLMs) on edge devices has the potentials for low-latency inference and privacy protection. However, meeting the substantial bandwidth demands of latency-oriented edge devices is challenging due to the strict power constraints of edge devices. Resistive random-access memory (RRAM)-based processing-in-memory (PIM) is an ideal solution for this challenge, thanks to its low read power and high internal bandwidth. Moreover, applying quantization methods, which require different precisions for weights and activations, is a common practice in edge inference. But existing accelerators cannot fully leverage the benefits of quantization, as they lack multiply-accumulate (MAC) units optimized for mixed-precision operands. To achieve low-latency edge inference, we design an RRAM-based PIM die that integrates dedicated energy-efficient MAC units, providing both computation and storage capabilities. Coupled with a dynamic random-access memory (DRAM) die for storing the key-value (KV) cache, we propose Lyla, an accelerator for low-latency edge LLM inference. Experimental results show that Lyla achieves 3.8x, 2.4x, and 1.2x latency improvements over a GPU and two DRAM-based PIM accelerators, respectively. |
| 关键词 | Random access memory Low latency communication Engines Bandwidth Vectors Registers Quantization (signal) Energy efficiency Hardware Computational modeling Large language model inference processing-in-memory edge accelerator |
| DOI | 10.1109/LCA.2025.3618104 |
| 收录类别 | SCI |
| 语种 | 英语 |
| 资助项目 | National Natural Science Foundation of China[62488101] ; National Natural Science Foundation of China[62495104] ; National Natural Science Foundation of China[62025404] ; Youth Innovation Promotion Association CAS |
| WOS研究方向 | Computer Science |
| WOS类目 | Computer Science, Hardware & Architecture |
| WOS记录号 | WOS:001600730100005 |
| 出版者 | IEEE COMPUTER SOC |
| 引用统计 | |
| 文献类型 | 期刊论文 |
| 条目标识符 | http://119.78.100.204/handle/2XEOYT63/41587 |
| 专题 | 中国科学院计算技术研究所期刊论文_英文 |
| 通讯作者 | Chen, Xiaoming |
| 作者单位 | 1.Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China 2.Chinese Acad Sci, Univ Chinese Acad Sci, State Key Lab Processors, Beijing 101408, Peoples R China 3.Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100190, Peoples R China |
| 推荐引用方式 GB/T 7714 | Wang, Xinyu,Sun, Xiaotian,Li, Wanqian,et al. Low-Latency PIM Accelerator for Edge LLM Inference[J]. IEEE COMPUTER ARCHITECTURE LETTERS,2025,24(2):321-324. |
| APA | Wang, Xinyu.,Sun, Xiaotian.,Li, Wanqian.,Min, Feng.,Zhang, Xiaoyu.,...&Chen, Xiaoming.(2025).Low-Latency PIM Accelerator for Edge LLM Inference.IEEE COMPUTER ARCHITECTURE LETTERS,24(2),321-324. |
| MLA | Wang, Xinyu,et al."Low-Latency PIM Accelerator for Edge LLM Inference".IEEE COMPUTER ARCHITECTURE LETTERS 24.2(2025):321-324. |
| 条目包含的文件 | 条目无相关文件。 | |||||
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论