CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Relative importance sampling for off-policy actor-critic in deep reinforcement learning
Humayoo, Mahammad1,2,3; Zheng, Gengzhong1; Dong, Xiaoqing1; Miao, Liming1; Qiu, Shuwei1; Zhou, Zexun1; Wang, Peitao1; Ullah, Zakir3,4; Junejo, Naveed Ur Rehman1; Cheng, Xueqi2,3
2025-04-24
发表期刊SCIENTIFIC REPORTS
ISSN2045-2322
卷号15期号:1页码:17
摘要Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi$$\end{document}) and the behavior policy (b) is a major cause of instability. High variance also originates from distributional mismatch. The variation between the target policy's distribution and the behavior policy's distribution can be reduced using importance sampling (IS). However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. To control variance, we alter the value of the smoothness parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta \in [0, 1]$$\end{document} in RIS. We develop the first model-free relative importance sampling off-policy actor-critic (RIS-off-PAC) algorithms in RL using this strategy. Our method uses a network to generate the target policy (actor) and evaluate the current policy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi$$\end{document}) using a value function (critic) based on behavior policy samples. Our algorithms are trained using behavior policy action values in the reward function, not target policy ones. Both the actor and critic are trained using deep neural networks. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
关键词Actor-critic (AC) Discrepancy Variance Importance sampling (IS) Off-policy Relative importance sampling (RIS)
DOI10.1038/s41598-025-96201-5
收录类别SCI
语种英语
资助项目The Innovation Teams of Ordinary Universities in Guangdong Province[2021KCXTD038] ; The Innovation Teams of Ordinary Universities in Guangdong Province[2023KCXTD022] ; Innovation Teams of Ordinary Universities in Guangdong Province[2022KSYS003] ; Key Laboratory of Ordinary Universities in Guangdong Province[2022A1515010990] ; Natural Science Foundation of Guangdong Province[2022KTSCX079] ; Natural Science Foundation of Guangdong Province[2023ZDZX1013] ; Natural Science Foundation of Guangdong Province[2022ZDZX3012] ; Natural Science Foundation of Guangdong Province[2022ZDZX3011] ; Natural Science Foundation of Guangdong Province[2023ZDZX2038] ; Research Fund of the Department of Education of Guangdong Province[z25025] ; Chaozhou Engineering Technology Research Center[XY202105] ; Hanshan Normal University project
WOS研究方向Science & Technology - Other Topics
WOS类目Multidisciplinary Sciences
WOS记录号WOS:001475579000029
出版者NATURE PORTFOLIO
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/40627
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Humayoo, Mahammad; Zheng, Gengzhong; Dong, Xiaoqing
作者单位1.Hanshan Normal Univ, Chaozhou 521041, Peoples R China
2.Chinese Acad Sci, Inst Comp Technol, CAS Key Lab Network Data Sci & Technol, Beijing 100190, Peoples R China
3.Univ Chinese Acad Sci, Beijing 101408, Peoples R China
4.Capital Univ Econ & Business, Sch Data Sci, Beijing 100070, Peoples R China
推荐引用方式
GB/T 7714
Humayoo, Mahammad,Zheng, Gengzhong,Dong, Xiaoqing,et al. Relative importance sampling for off-policy actor-critic in deep reinforcement learning[J]. SCIENTIFIC REPORTS,2025,15(1):17.
APA Humayoo, Mahammad.,Zheng, Gengzhong.,Dong, Xiaoqing.,Miao, Liming.,Qiu, Shuwei.,...&Cheng, Xueqi.(2025).Relative importance sampling for off-policy actor-critic in deep reinforcement learning.SCIENTIFIC REPORTS,15(1),17.
MLA Humayoo, Mahammad,et al."Relative importance sampling for off-policy actor-critic in deep reinforcement learning".SCIENTIFIC REPORTS 15.1(2025):17.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Humayoo, Mahammad]的文章
[Zheng, Gengzhong]的文章
[Dong, Xiaoqing]的文章
百度学术
百度学术中相似的文章
[Humayoo, Mahammad]的文章
[Zheng, Gengzhong]的文章
[Dong, Xiaoqing]的文章
必应学术
必应学术中相似的文章
[Humayoo, Mahammad]的文章
[Zheng, Gengzhong]的文章
[Dong, Xiaoqing]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。