Moor: Model-based offline policy optimization with a risk dynamics model

Abstract Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aver...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaolong Su, Peng Li, Shaofei Chen
Format: Article
Language:English
Published: Springer 2024-11-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01621-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571183188410368
author Xiaolong Su
Peng Li
Shaofei Chen
author_facet Xiaolong Su
Peng Li
Shaofei Chen
author_sort Xiaolong Su
collection DOAJ
description Abstract Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aversion. However, current model-based approaches only extract state transition information and reward information using dynamics models, which cannot capture risk information implicit in offline data and may result in the misuse of high-risk data. In this work, we propose a model-based offline policy optimization approach with a risk dynamics model (MOOR). Specifically, we construct a risk dynamics model using a quantile network that can learn the risk information of data, then we reshape model-generated data based on errors of the risk dynamics model and the risk information of data. Finally, we use a risk-averse algorithm to learn the policy on the combined dataset of offline and generated data. We theoretically prove that MOOR can identify risk information of data and avoid utilizing high-risk data, our experiments show that MOOR outperforms existing approaches and achieves state-of-the-art results in risk-sensitive D4RL and risky navigation tasks.
format Article
id doaj-art-92c83d0398244043abe7ab85bee3513b
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2024-11-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-92c83d0398244043abe7ab85bee3513b2025-02-02T12:49:54ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111510.1007/s40747-024-01621-xMoor: Model-based offline policy optimization with a risk dynamics modelXiaolong Su0Peng Li1Shaofei Chen2College of Intelligence Science and Technology, National University of Defense TechnologyCollege of Intelligence Science and Technology, National University of Defense TechnologyCollege of Intelligence Science and Technology, National University of Defense TechnologyAbstract Offline reinforcement learning (RL) has been widely used in safety-critical domains by avoiding dangerous and costly online interaction. A significant challenge is addressing uncertainties and risks outside of offline data. Risk-sensitive offline RL attempts to solve this issue by risk aversion. However, current model-based approaches only extract state transition information and reward information using dynamics models, which cannot capture risk information implicit in offline data and may result in the misuse of high-risk data. In this work, we propose a model-based offline policy optimization approach with a risk dynamics model (MOOR). Specifically, we construct a risk dynamics model using a quantile network that can learn the risk information of data, then we reshape model-generated data based on errors of the risk dynamics model and the risk information of data. Finally, we use a risk-averse algorithm to learn the policy on the combined dataset of offline and generated data. We theoretically prove that MOOR can identify risk information of data and avoid utilizing high-risk data, our experiments show that MOOR outperforms existing approaches and achieves state-of-the-art results in risk-sensitive D4RL and risky navigation tasks.https://doi.org/10.1007/s40747-024-01621-xOffline Reinforcement LearningRisk-sensitive Reinforcement LearningRisk Dynamics ModelReward Relabelling
spellingShingle Xiaolong Su
Peng Li
Shaofei Chen
Moor: Model-based offline policy optimization with a risk dynamics model
Complex & Intelligent Systems
Offline Reinforcement Learning
Risk-sensitive Reinforcement Learning
Risk Dynamics Model
Reward Relabelling
title Moor: Model-based offline policy optimization with a risk dynamics model
title_full Moor: Model-based offline policy optimization with a risk dynamics model
title_fullStr Moor: Model-based offline policy optimization with a risk dynamics model
title_full_unstemmed Moor: Model-based offline policy optimization with a risk dynamics model
title_short Moor: Model-based offline policy optimization with a risk dynamics model
title_sort moor model based offline policy optimization with a risk dynamics model
topic Offline Reinforcement Learning
Risk-sensitive Reinforcement Learning
Risk Dynamics Model
Reward Relabelling
url https://doi.org/10.1007/s40747-024-01621-x
work_keys_str_mv AT xiaolongsu moormodelbasedofflinepolicyoptimizationwithariskdynamicsmodel
AT pengli moormodelbasedofflinepolicyoptimizationwithariskdynamicsmodel
AT shaofeichen moormodelbasedofflinepolicyoptimizationwithariskdynamicsmodel