Online Attentive Kernel-Based Off-Policy Temporal Difference Learning

Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shangdong Yang, Shuaiqiang Zhang, Xingguo Chen
Format:	Article
Language:	English
Published:	MDPI AG 2024-11-01
Series:	Applied Sciences
Subjects:	online attentive learning kernel-based methods reinforcement learning off-policy temporal difference learning two-timescale analysis
Online Access:	https://www.mdpi.com/2076-3417/14/23/11114
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850107436285820928
author	Shangdong Yang Shuaiqiang Zhang Xingguo Chen
author_facet	Shangdong Yang Shuaiqiang Zhang Xingguo Chen
author_sort	Shangdong Yang
collection	DOAJ
description	Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.
format	Article
id	doaj-art-318b65f0260b42caabbbbb87af59ce5e
institution	OA Journals
issn	2076-3417
language	English
publishDate	2024-11-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-318b65f0260b42caabbbbb87af59ce5e2025-08-20T02:38:35ZengMDPI AGApplied Sciences2076-34172024-11-0114231111410.3390/app142311114Online Attentive Kernel-Based Off-Policy Temporal Difference LearningShangdong Yang0Shuaiqiang Zhang1Xingguo Chen2School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaSchool of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaSchool of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaTemporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.https://www.mdpi.com/2076-3417/14/23/11114online attentive learningkernel-based methodsreinforcement learningoff-policy temporal difference learningtwo-timescale analysis
spellingShingle	Shangdong Yang Shuaiqiang Zhang Xingguo Chen Online Attentive Kernel-Based Off-Policy Temporal Difference Learning Applied Sciences online attentive learning kernel-based methods reinforcement learning off-policy temporal difference learning two-timescale analysis
title	Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_full	Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_fullStr	Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_full_unstemmed	Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_short	Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_sort	online attentive kernel based off policy temporal difference learning
topic	online attentive learning kernel-based methods reinforcement learning off-policy temporal difference learning two-timescale analysis
url	https://www.mdpi.com/2076-3417/14/23/11114
work_keys_str_mv	AT shangdongyang onlineattentivekernelbasedoffpolicytemporaldifferencelearning AT shuaiqiangzhang onlineattentivekernelbasedoffpolicytemporaldifferencelearning AT xingguochen onlineattentivekernelbasedoffpolicytemporaldifferencelearning

Online Attentive Kernel-Based Off-Policy Temporal Difference Learning

Similar Items