Variation-Aware Bernstein-Based Upper Confidence Reinforcement Learning for Environment With Endogenous and Exogenous Uncertainty

Online Reinforcement Learning (RL) has yielded remarkable performance in dynamic wireless communication and networks by interacting with the environment and gradually improving the effectiveness of its policy. As it is normal to witness much uncertainty in such an environment due to the intrinsic ra...

Full description

Saved in:
Bibliographic Details
Main Authors: Ruoqi Wen, Rongpeng Li
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11028620/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Online Reinforcement Learning (RL) has yielded remarkable performance in dynamic wireless communication and networks by interacting with the environment and gradually improving the effectiveness of its policy. As it is normal to witness much uncertainty in such an environment due to the intrinsic randomness of channels and service demands, designing a sample-efficient RL with bounded regrets has significant merits. In this paper, we focus on general Markov Decision Processes (MDPs) with time-evolving rewards and state transition probability unknown a priori and develop a Variation-aware Bernstein-based Upper Confidence Reinforcement Learning (VB-UCRL). In particular, we allow for restarting VB-UCRL according to a variation-aware schedule. We successfully overcome the challenges due to both endogenous and exogenous uncertainty and establish a regret bound of saving at most <inline-formula> <tex-math notation="LaTeX">$\sqrt {S}$ </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">$S^{\frac {1}{6}}T^{\frac {1}{12}}$ </tex-math></inline-formula> compared with the latest results in the literature, where S denotes the size of the state space of the MDP and T indicates the iteration index of learning time-steps. Finally, we show via simulation that our algorithm VB-UCRL significantly outperforms the existing algorithms in the literature.
ISSN:2169-3536