Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards

Abstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, s...

Full description

Saved in:
Bibliographic Details
Main Authors: Rui Tu, Zhipeng Luo, Chuanliang Pan, Zhong Wang, Jie Su, Yu Zhang, Yifan Wang
Format: Article
Language:English
Published: Springer Nature 2025-02-01
Series:Human-Centric Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s44230-025-00093-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849709676649775104
author Rui Tu
Zhipeng Luo
Chuanliang Pan
Zhong Wang
Jie Su
Yu Zhang
Yifan Wang
author_facet Rui Tu
Zhipeng Luo
Chuanliang Pan
Zhong Wang
Jie Su
Yu Zhang
Yifan Wang
author_sort Rui Tu
collection DOAJ
description Abstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, such methods generalize poorly on variable-length data that can be overlong. In such as case, a single final reward signal appears very sparse. Meanwhile, safety is often overlooked by many models, leading them to make excessively extreme recommendations. In this paper, we study how to recommend effective and safe treatments for critically ill septic patients. We develop an offline reinforcement learning model based on CQL (Conservative Q-Learning), which underestimates the expected rewards of rarely seen treatments in data, thus enjoying a high safety standard. We further enhance the model with intermediate rewards by particularly using the Apache II scoring system. This can effectively deal with variable-length episodes with sparse rewards. By performing extensive experiments on the MIMIC-III database, we demonstrated the enhanced performance and robustness in safety. Our code of data extraction, preprocessing, and modeling can be found at https://github.com/OOPSDINOSAUR/RL_safety_model .
format Article
id doaj-art-c7a5f8e615574737b3f54a8306fe77aa
institution DOAJ
issn 2667-1336
language English
publishDate 2025-02-01
publisher Springer Nature
record_format Article
series Human-Centric Intelligent Systems
spelling doaj-art-c7a5f8e615574737b3f54a8306fe77aa2025-08-20T03:15:12ZengSpringer NatureHuman-Centric Intelligent Systems2667-13362025-02-0151637610.1007/s44230-025-00093-7Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse RewardsRui Tu0Zhipeng Luo1Chuanliang Pan2Zhong Wang3Jie Su4Yu Zhang5Yifan Wang6School of Computing and Artificial Intelligence, Southwest Jiaotong UniversitySchool of Computing and Artificial Intelligence, Southwest Jiaotong UniversityDepartment of Intensive Care Units, The Third People’s HospitalDepartment of Intensive Care Units, The Third People’s HospitalDepartment of Intensive Care Units, The Third People’s HospitalDepartment of Intensive Care Units, Tangshan People’s HospitalSchool of Computing and Artificial Intelligence, Southwest Jiaotong UniversityAbstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, such methods generalize poorly on variable-length data that can be overlong. In such as case, a single final reward signal appears very sparse. Meanwhile, safety is often overlooked by many models, leading them to make excessively extreme recommendations. In this paper, we study how to recommend effective and safe treatments for critically ill septic patients. We develop an offline reinforcement learning model based on CQL (Conservative Q-Learning), which underestimates the expected rewards of rarely seen treatments in data, thus enjoying a high safety standard. We further enhance the model with intermediate rewards by particularly using the Apache II scoring system. This can effectively deal with variable-length episodes with sparse rewards. By performing extensive experiments on the MIMIC-III database, we demonstrated the enhanced performance and robustness in safety. Our code of data extraction, preprocessing, and modeling can be found at https://github.com/OOPSDINOSAUR/RL_safety_model .https://doi.org/10.1007/s44230-025-00093-7Offline reinforcement learningIntermediate rewardsVariable-length time seriesSepsis treatment
spellingShingle Rui Tu
Zhipeng Luo
Chuanliang Pan
Zhong Wang
Jie Su
Yu Zhang
Yifan Wang
Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
Human-Centric Intelligent Systems
Offline reinforcement learning
Intermediate rewards
Variable-length time series
Sepsis treatment
title Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
title_full Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
title_fullStr Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
title_full_unstemmed Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
title_short Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
title_sort offline safe reinforcement learning for sepsis treatment tackling variable length episodes with sparse rewards
topic Offline reinforcement learning
Intermediate rewards
Variable-length time series
Sepsis treatment
url https://doi.org/10.1007/s44230-025-00093-7
work_keys_str_mv AT ruitu offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards
AT zhipengluo offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards
AT chuanliangpan offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards
AT zhongwang offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards
AT jiesu offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards
AT yuzhang offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards
AT yifanwang offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards