Relative importance sampling for off-policy actor-critic in deep reinforcement learning

doi:10.1038/s41598-025-96201-5

	Relative importance sampling for off-policy actor-critic in deep reinforcement learning
	Humayoo, Mahammad 1,2,3; Zheng, Gengzhong 1; Dong, Xiaoqing 1; Miao, Liming 1; Qiu, Shuwei 1; Zhou, Zexun 1; Wang, Peitao 1; Ullah, Zakir 3,4; Junejo, Naveed Ur Rehman 1; Cheng, Xueqi 2,3
	2025-04-24
发表期刊	SCIENTIFIC REPORTS
ISSN	2045-2322
卷号	15 期号:1 页码:17
摘要	Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi$$\end{document}) and the behavior policy (b) is a major cause of instability. High variance also originates from distributional mismatch. The variation between the target policy's distribution and the behavior policy's distribution can be reduced using importance sampling (IS). However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. To control variance, we alter the value of the smoothness parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta \in [0, 1]$$\end{document} in RIS. We develop the first model-free relative importance sampling off-policy actor-critic (RIS-off-PAC) algorithms in RL using this strategy. Our method uses a network to generate the target policy (actor) and evaluate the current policy (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi$$\end{document}) using a value function (critic) based on behavior policy samples. Our algorithms are trained using behavior policy action values in the reward function, not target policy ones. Both the actor and critic are trained using deep neural networks. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
关键词	Actor-critic (AC) Discrepancy Variance Importance sampling (IS) Off-policy Relative importance sampling (RIS)
DOI	10.1038/s41598-025-96201-5
收录类别	SCI
语种	英语
资助项目	The Innovation Teams of Ordinary Universities in Guangdong Province[2021KCXTD038] ; The Innovation Teams of Ordinary Universities in Guangdong Province[2023KCXTD022] ; Innovation Teams of Ordinary Universities in Guangdong Province[2022KSYS003] ; Key Laboratory of Ordinary Universities in Guangdong Province[2022A1515010990] ; Natural Science Foundation of Guangdong Province[2022KTSCX079] ; Natural Science Foundation of Guangdong Province[2023ZDZX1013] ; Natural Science Foundation of Guangdong Province[2022ZDZX3012] ; Natural Science Foundation of Guangdong Province[2022ZDZX3011] ; Natural Science Foundation of Guangdong Province[2023ZDZX2038] ; Research Fund of the Department of Education of Guangdong Province[z25025] ; Chaozhou Engineering Technology Research Center[XY202105] ; Hanshan Normal University project
WOS研究方向	Science & Technology - Other Topics
WOS类目	Multidisciplinary Sciences
WOS记录号	WOS:001475579000029
出版者	NATURE PORTFOLIO
引用统计	被引频次：1[WOS] [WOS记录] [WOS相关记录]
文献类型	期刊论文
条目标识符	http://119.78.100.204/handle/2XEOYT63/40627
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Humayoo, Mahammad; Zheng, Gengzhong; Dong, Xiaoqing
作者单位	1.Hanshan Normal Univ, Chaozhou 521041, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, CAS Key Lab Network Data Sci & Technol, Beijing 100190, Peoples R China 3.Univ Chinese Acad Sci, Beijing 101408, Peoples R China 4.Capital Univ Econ & Business, Sch Data Sci, Beijing 100070, Peoples R China
推荐引用方式 GB/T 7714	Humayoo, Mahammad,Zheng, Gengzhong,Dong, Xiaoqing,et al. Relative importance sampling for off-policy actor-critic in deep reinforcement learning[J]. SCIENTIFIC REPORTS,2025,15(1):17.
APA	Humayoo, Mahammad.,Zheng, Gengzhong.,Dong, Xiaoqing.,Miao, Liming.,Qiu, Shuwei.,...&Cheng, Xueqi.(2025).Relative importance sampling for off-policy actor-critic in deep reinforcement learning.SCIENTIFIC REPORTS,15(1),17.
MLA	Humayoo, Mahammad,et al."Relative importance sampling for off-policy actor-critic in deep reinforcement learning".SCIENTIFIC REPORTS 15.1(2025):17.