Moor: Model-based offline policy optimization with a risk dynamics model
Abstract Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aver...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2024-11-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-024-01621-x |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832571183188410368 |
---|---|
author | Xiaolong Su Peng Li Shaofei Chen |
author_facet | Xiaolong Su Peng Li Shaofei Chen |
author_sort | Xiaolong Su |
collection | DOAJ |
description | Abstract Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aversion. However, current model-based approaches only extract state transition information and reward information using dynamics models, which cannot capture risk information implicit in offline data and may result in the misuse of high-risk data. In this work, we propose a model-based offline policy optimization approach with a risk dynamics model (MOOR). Specifically, we construct a risk dynamics model using a quantile network that can learn the risk information of data, then we reshape model-generated data based on errors of the risk dynamics model and the risk information of data. Finally, we use a risk-averse algorithm to learn the policy on the combined dataset of offline and generated data. We theoretically prove that MOOR can identify risk information of data and avoid utilizing high-risk data, our experiments show that MOOR outperforms existing approaches and achieves state-of-the-art results in risk-sensitive D4RL and risky navigation tasks. |
format | Article |
id | doaj-art-92c83d0398244043abe7ab85bee3513b |
institution | Kabale University |
issn | 2199-4536 2198-6053 |
language | English |
publishDate | 2024-11-01 |
publisher | Springer |
record_format | Article |
series | Complex & Intelligent Systems |
spelling | doaj-art-92c83d0398244043abe7ab85bee3513b2025-02-02T12:49:54ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111510.1007/s40747-024-01621-xMoor: Model-based offline policy optimization with a risk dynamics modelXiaolong Su0Peng Li1Shaofei Chen2College of Intelligence Science and Technology, National University of Defense TechnologyCollege of Intelligence Science and Technology, National University of Defense TechnologyCollege of Intelligence Science and Technology, National University of Defense TechnologyAbstract Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aversion. However, current model-based approaches only extract state transition information and reward information using dynamics models, which cannot capture risk information implicit in offline data and may result in the misuse of high-risk data. In this work, we propose a model-based offline policy optimization approach with a risk dynamics model (MOOR). Specifically, we construct a risk dynamics model using a quantile network that can learn the risk information of data, then we reshape model-generated data based on errors of the risk dynamics model and the risk information of data. Finally, we use a risk-averse algorithm to learn the policy on the combined dataset of offline and generated data. We theoretically prove that MOOR can identify risk information of data and avoid utilizing high-risk data, our experiments show that MOOR outperforms existing approaches and achieves state-of-the-art results in risk-sensitive D4RL and risky navigation tasks.https://doi.org/10.1007/s40747-024-01621-xOffline Reinforcement LearningRisk-sensitive Reinforcement LearningRisk Dynamics ModelReward Relabelling |
spellingShingle | Xiaolong Su Peng Li Shaofei Chen Moor: Model-based offline policy optimization with a risk dynamics model Complex & Intelligent Systems Offline Reinforcement Learning Risk-sensitive Reinforcement Learning Risk Dynamics Model Reward Relabelling |
title | Moor: Model-based offline policy optimization with a risk dynamics model |
title_full | Moor: Model-based offline policy optimization with a risk dynamics model |
title_fullStr | Moor: Model-based offline policy optimization with a risk dynamics model |
title_full_unstemmed | Moor: Model-based offline policy optimization with a risk dynamics model |
title_short | Moor: Model-based offline policy optimization with a risk dynamics model |
title_sort | moor model based offline policy optimization with a risk dynamics model |
topic | Offline Reinforcement Learning Risk-sensitive Reinforcement Learning Risk Dynamics Model Reward Relabelling |
url | https://doi.org/10.1007/s40747-024-01621-x |
work_keys_str_mv | AT xiaolongsu moormodelbasedofflinepolicyoptimizationwithariskdynamicsmodel AT pengli moormodelbasedofflinepolicyoptimizationwithariskdynamicsmodel AT shaofeichen moormodelbasedofflinepolicyoptimizationwithariskdynamicsmodel |