Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards
Abstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, s...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer Nature
2025-02-01
|
| Series: | Human-Centric Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44230-025-00093-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849709676649775104 |
|---|---|
| author | Rui Tu Zhipeng Luo Chuanliang Pan Zhong Wang Jie Su Yu Zhang Yifan Wang |
| author_facet | Rui Tu Zhipeng Luo Chuanliang Pan Zhong Wang Jie Su Yu Zhang Yifan Wang |
| author_sort | Rui Tu |
| collection | DOAJ |
| description | Abstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, such methods generalize poorly on variable-length data that can be overlong. In such as case, a single final reward signal appears very sparse. Meanwhile, safety is often overlooked by many models, leading them to make excessively extreme recommendations. In this paper, we study how to recommend effective and safe treatments for critically ill septic patients. We develop an offline reinforcement learning model based on CQL (Conservative Q-Learning), which underestimates the expected rewards of rarely seen treatments in data, thus enjoying a high safety standard. We further enhance the model with intermediate rewards by particularly using the Apache II scoring system. This can effectively deal with variable-length episodes with sparse rewards. By performing extensive experiments on the MIMIC-III database, we demonstrated the enhanced performance and robustness in safety. Our code of data extraction, preprocessing, and modeling can be found at https://github.com/OOPSDINOSAUR/RL_safety_model . |
| format | Article |
| id | doaj-art-c7a5f8e615574737b3f54a8306fe77aa |
| institution | DOAJ |
| issn | 2667-1336 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | Springer Nature |
| record_format | Article |
| series | Human-Centric Intelligent Systems |
| spelling | doaj-art-c7a5f8e615574737b3f54a8306fe77aa2025-08-20T03:15:12ZengSpringer NatureHuman-Centric Intelligent Systems2667-13362025-02-0151637610.1007/s44230-025-00093-7Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse RewardsRui Tu0Zhipeng Luo1Chuanliang Pan2Zhong Wang3Jie Su4Yu Zhang5Yifan Wang6School of Computing and Artificial Intelligence, Southwest Jiaotong UniversitySchool of Computing and Artificial Intelligence, Southwest Jiaotong UniversityDepartment of Intensive Care Units, The Third People’s HospitalDepartment of Intensive Care Units, The Third People’s HospitalDepartment of Intensive Care Units, The Third People’s HospitalDepartment of Intensive Care Units, Tangshan People’s HospitalSchool of Computing and Artificial Intelligence, Southwest Jiaotong UniversityAbstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, such methods generalize poorly on variable-length data that can be overlong. In such as case, a single final reward signal appears very sparse. Meanwhile, safety is often overlooked by many models, leading them to make excessively extreme recommendations. In this paper, we study how to recommend effective and safe treatments for critically ill septic patients. We develop an offline reinforcement learning model based on CQL (Conservative Q-Learning), which underestimates the expected rewards of rarely seen treatments in data, thus enjoying a high safety standard. We further enhance the model with intermediate rewards by particularly using the Apache II scoring system. This can effectively deal with variable-length episodes with sparse rewards. By performing extensive experiments on the MIMIC-III database, we demonstrated the enhanced performance and robustness in safety. Our code of data extraction, preprocessing, and modeling can be found at https://github.com/OOPSDINOSAUR/RL_safety_model .https://doi.org/10.1007/s44230-025-00093-7Offline reinforcement learningIntermediate rewardsVariable-length time seriesSepsis treatment |
| spellingShingle | Rui Tu Zhipeng Luo Chuanliang Pan Zhong Wang Jie Su Yu Zhang Yifan Wang Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards Human-Centric Intelligent Systems Offline reinforcement learning Intermediate rewards Variable-length time series Sepsis treatment |
| title | Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards |
| title_full | Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards |
| title_fullStr | Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards |
| title_full_unstemmed | Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards |
| title_short | Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards |
| title_sort | offline safe reinforcement learning for sepsis treatment tackling variable length episodes with sparse rewards |
| topic | Offline reinforcement learning Intermediate rewards Variable-length time series Sepsis treatment |
| url | https://doi.org/10.1007/s44230-025-00093-7 |
| work_keys_str_mv | AT ruitu offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards AT zhipengluo offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards AT chuanliangpan offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards AT zhongwang offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards AT jiesu offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards AT yuzhang offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards AT yifanwang offlinesafereinforcementlearningforsepsistreatmenttacklingvariablelengthepisodeswithsparserewards |