Safety Verification of Non-Deterministic Policies in Reinforcement Learning

Reinforcement Learning represents a powerful paradigm in artificial intelligence, enabling agents to learn optimal behaviors through interactions with their environment. However, ensuring the safety of policies learned in non-deterministic environments, where outcomes are inherently uncertain and va...

Full description

Saved in:
Bibliographic Details
Main Authors: Ryeonggu Kwon, Gihwon Kwon
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10786219/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Reinforcement Learning represents a powerful paradigm in artificial intelligence, enabling agents to learn optimal behaviors through interactions with their environment. However, ensuring the safety of policies learned in non-deterministic environments, where outcomes are inherently uncertain and variable, remains a critical challenge. Safety in this context refers to the assurance that an agent&#x2019;s actions will not lead to undesirable or hazardous states, which is paramount in high-risk domains like autonomous driving and healthcare. This study proposes a novel methodology to rigorously verify the safety properties of non-deterministic policies in RL. We transform non-deterministic policies into Markov Decision Process models and employ the state-of-the-art probabilistic model checker STORM for formal verification. Our approach achieves a maximum probability of 1.0 for reaching the goal state in non-slippery environments, while in slippery environments, The probability declines to approximately 0.041 on the <inline-formula> <tex-math notation="LaTeX">$4\times 4$ </tex-math></inline-formula> map and 0.0023 on the <inline-formula> <tex-math notation="LaTeX">$8\times 8$ </tex-math></inline-formula> map. Additionally, we observed a significant reduction in the probability of not reaching the goal state in non-slippery conditions, while slippery conditions increased the likelihood of failure, with probabilities as high as 0.934 in complex scenarios. Empirical validation through extensive experiments in the FrozenLake environment highlights the challenges and effectiveness of our approach under varying conditions. Our contributions are fourfold: First, we propose a formal verification method for RL policies in non-deterministic environments. Second, we provide a comprehensive framework for rigorously modeling and analyzing non-deterministic policy safety properties by integrating MDP models with the STORM model checker. Third, We use model-free RL to generate an MDP model from the trained policy. This allows us to evaluate goal achievement and identify potential unsafe states caused by non-deterministic environments. Fourth, our extensive empirical validation offers concrete insights into non-deterministic policy verification challenges and solutions. In conclusion, this study addresses the pivotal issue of ensuring the safety of non-deterministic policies in RL. Through formal verification techniques and comprehensive empirical validation, we provide a robust methodology that enhances the trustworthiness and applicability of non-deterministic policies in real-world applications.
ISSN:2169-3536