Institute of Computing Technology, Chinese Academy IR
Relative importance sampling for off-policy actor-critic in deep reinforcement learning | |
Humayoo, Mahammad1,2,3; Zheng, Gengzhong1; Dong, Xiaoqing1; Miao, Liming1; Qiu, Shuwei1; Zhou, Zexun1; Wang, Peitao1; Ullah, Zakir3,4; Junejo, Naveed Ur Rehman1; Cheng, Xueqi2,3 | |
2025-04-24 | |
发表期刊 | SCIENTIFIC REPORTS
![]() |
ISSN | 2045-2322 |
卷号 | 15期号:1页码:17 |
摘要 | Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi$$\end{document}) and the behavior policy (b) is a major cause of instability. High variance also originates from distributional mismatch. The variation between the target policy's distribution and the behavior policy's distribution can be reduced using importance sampling (IS). However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. To control variance, we alter the value of the smoothness parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta \in [0, 1]$$\end{document} in RIS. We develop the first model-free relative importance sampling off-policy actor-critic (RIS-off-PAC) algorithms in RL using this strategy. Our method uses a network to generate the target policy (actor) and evaluate the current policy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi$$\end{document}) using a value function (critic) based on behavior policy samples. Our algorithms are trained using behavior policy action values in the reward function, not target policy ones. Both the actor and critic are trained using deep neural networks. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets. |
关键词 | Actor-critic (AC) Discrepancy Variance Importance sampling (IS) Off-policy Relative importance sampling (RIS) |
DOI | 10.1038/s41598-025-96201-5 |
收录类别 | SCI |
语种 | 英语 |
资助项目 | The Innovation Teams of Ordinary Universities in Guangdong Province[2021KCXTD038] ; The Innovation Teams of Ordinary Universities in Guangdong Province[2023KCXTD022] ; Innovation Teams of Ordinary Universities in Guangdong Province[2022KSYS003] ; Key Laboratory of Ordinary Universities in Guangdong Province[2022A1515010990] ; Natural Science Foundation of Guangdong Province[2022KTSCX079] ; Natural Science Foundation of Guangdong Province[2023ZDZX1013] ; Natural Science Foundation of Guangdong Province[2022ZDZX3012] ; Natural Science Foundation of Guangdong Province[2022ZDZX3011] ; Natural Science Foundation of Guangdong Province[2023ZDZX2038] ; Research Fund of the Department of Education of Guangdong Province[z25025] ; Chaozhou Engineering Technology Research Center[XY202105] ; Hanshan Normal University project |
WOS研究方向 | Science & Technology - Other Topics |
WOS类目 | Multidisciplinary Sciences |
WOS记录号 | WOS:001475579000029 |
出版者 | NATURE PORTFOLIO |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://119.78.100.204/handle/2XEOYT63/40627 |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Humayoo, Mahammad; Zheng, Gengzhong; Dong, Xiaoqing |
作者单位 | 1.Hanshan Normal Univ, Chaozhou 521041, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, CAS Key Lab Network Data Sci & Technol, Beijing 100190, Peoples R China 3.Univ Chinese Acad Sci, Beijing 101408, Peoples R China 4.Capital Univ Econ & Business, Sch Data Sci, Beijing 100070, Peoples R China |
推荐引用方式 GB/T 7714 | Humayoo, Mahammad,Zheng, Gengzhong,Dong, Xiaoqing,et al. Relative importance sampling for off-policy actor-critic in deep reinforcement learning[J]. SCIENTIFIC REPORTS,2025,15(1):17. |
APA | Humayoo, Mahammad.,Zheng, Gengzhong.,Dong, Xiaoqing.,Miao, Liming.,Qiu, Shuwei.,...&Cheng, Xueqi.(2025).Relative importance sampling for off-policy actor-critic in deep reinforcement learning.SCIENTIFIC REPORTS,15(1),17. |
MLA | Humayoo, Mahammad,et al."Relative importance sampling for off-policy actor-critic in deep reinforcement learning".SCIENTIFIC REPORTS 15.1(2025):17. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论