Online Attentive Kernel-Based Off-Policy Temporal Difference Learning

Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based...

Full description

Saved in:
Bibliographic Details
Main Authors: Shangdong Yang, Shuaiqiang Zhang, Xingguo Chen
Format: Article
Language:English
Published: MDPI AG 2024-11-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/23/11114
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850107436285820928
author Shangdong Yang
Shuaiqiang Zhang
Xingguo Chen
author_facet Shangdong Yang
Shuaiqiang Zhang
Xingguo Chen
author_sort Shangdong Yang
collection DOAJ
description Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.
format Article
id doaj-art-318b65f0260b42caabbbbb87af59ce5e
institution OA Journals
issn 2076-3417
language English
publishDate 2024-11-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-318b65f0260b42caabbbbb87af59ce5e2025-08-20T02:38:35ZengMDPI AGApplied Sciences2076-34172024-11-0114231111410.3390/app142311114Online Attentive Kernel-Based Off-Policy Temporal Difference LearningShangdong Yang0Shuaiqiang Zhang1Xingguo Chen2School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaSchool of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaSchool of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaTemporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.https://www.mdpi.com/2076-3417/14/23/11114online attentive learningkernel-based methodsreinforcement learningoff-policy temporal difference learningtwo-timescale analysis
spellingShingle Shangdong Yang
Shuaiqiang Zhang
Xingguo Chen
Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
Applied Sciences
online attentive learning
kernel-based methods
reinforcement learning
off-policy temporal difference learning
two-timescale analysis
title Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_full Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_fullStr Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_full_unstemmed Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_short Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
title_sort online attentive kernel based off policy temporal difference learning
topic online attentive learning
kernel-based methods
reinforcement learning
off-policy temporal difference learning
two-timescale analysis
url https://www.mdpi.com/2076-3417/14/23/11114
work_keys_str_mv AT shangdongyang onlineattentivekernelbasedoffpolicytemporaldifferencelearning
AT shuaiqiangzhang onlineattentivekernelbasedoffpolicytemporaldifferencelearning
AT xingguochen onlineattentivekernelbasedoffpolicytemporaldifferencelearning