Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-11-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/14/23/11114 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850107436285820928 |
|---|---|
| author | Shangdong Yang Shuaiqiang Zhang Xingguo Chen |
| author_facet | Shangdong Yang Shuaiqiang Zhang Xingguo Chen |
| author_sort | Shangdong Yang |
| collection | DOAJ |
| description | Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms. |
| format | Article |
| id | doaj-art-318b65f0260b42caabbbbb87af59ce5e |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2024-11-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-318b65f0260b42caabbbbb87af59ce5e2025-08-20T02:38:35ZengMDPI AGApplied Sciences2076-34172024-11-0114231111410.3390/app142311114Online Attentive Kernel-Based Off-Policy Temporal Difference LearningShangdong Yang0Shuaiqiang Zhang1Xingguo Chen2School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaSchool of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaSchool of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, ChinaTemporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.https://www.mdpi.com/2076-3417/14/23/11114online attentive learningkernel-based methodsreinforcement learningoff-policy temporal difference learningtwo-timescale analysis |
| spellingShingle | Shangdong Yang Shuaiqiang Zhang Xingguo Chen Online Attentive Kernel-Based Off-Policy Temporal Difference Learning Applied Sciences online attentive learning kernel-based methods reinforcement learning off-policy temporal difference learning two-timescale analysis |
| title | Online Attentive Kernel-Based Off-Policy Temporal Difference Learning |
| title_full | Online Attentive Kernel-Based Off-Policy Temporal Difference Learning |
| title_fullStr | Online Attentive Kernel-Based Off-Policy Temporal Difference Learning |
| title_full_unstemmed | Online Attentive Kernel-Based Off-Policy Temporal Difference Learning |
| title_short | Online Attentive Kernel-Based Off-Policy Temporal Difference Learning |
| title_sort | online attentive kernel based off policy temporal difference learning |
| topic | online attentive learning kernel-based methods reinforcement learning off-policy temporal difference learning two-timescale analysis |
| url | https://www.mdpi.com/2076-3417/14/23/11114 |
| work_keys_str_mv | AT shangdongyang onlineattentivekernelbasedoffpolicytemporaldifferencelearning AT shuaiqiangzhang onlineattentivekernelbasedoffpolicytemporaldifferencelearning AT xingguochen onlineattentivekernelbasedoffpolicytemporaldifferencelearning |