Variation-Aware Bernstein-Based Upper Confidence Reinforcement Learning for Environment With Endogenous and Exogenous Uncertainty
Online Reinforcement Learning (RL) has yielded remarkable performance in dynamic wireless communication and networks by interacting with the environment and gradually improving the effectiveness of its policy. As it is normal to witness much uncertainty in such an environment due to the intrinsic ra...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11028620/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Online Reinforcement Learning (RL) has yielded remarkable performance in dynamic wireless communication and networks by interacting with the environment and gradually improving the effectiveness of its policy. As it is normal to witness much uncertainty in such an environment due to the intrinsic randomness of channels and service demands, designing a sample-efficient RL with bounded regrets has significant merits. In this paper, we focus on general Markov Decision Processes (MDPs) with time-evolving rewards and state transition probability unknown a priori and develop a Variation-aware Bernstein-based Upper Confidence Reinforcement Learning (VB-UCRL). In particular, we allow for restarting VB-UCRL according to a variation-aware schedule. We successfully overcome the challenges due to both endogenous and exogenous uncertainty and establish a regret bound of saving at most <inline-formula> <tex-math notation="LaTeX">$\sqrt {S}$ </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">$S^{\frac {1}{6}}T^{\frac {1}{12}}$ </tex-math></inline-formula> compared with the latest results in the literature, where S denotes the size of the state space of the MDP and T indicates the iteration index of learning time-steps. Finally, we show via simulation that our algorithm VB-UCRL significantly outperforms the existing algorithms in the literature. |
|---|---|
| ISSN: | 2169-3536 |