Offline reinforcement learning combining generalized advantage estimation and modality decomposition interaction
Abstract Transformers show great potential in offline reinforcement learning via trajectory sequence modeling for action prediction. However, existing Transformer-based methods face limitations, such as ineffective trajectory stitching and the neglect of deep interactions within and between multimod...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-05-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-98572-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Transformers show great potential in offline reinforcement learning via trajectory sequence modeling for action prediction. However, existing Transformer-based methods face limitations, such as ineffective trajectory stitching and the neglect of deep interactions within and between multimodal information in trajectories. We propose CGM, an offline reinforcement learning approach that combines Generalized Advantage Estimation with Modality Decomposition Interaction (MDI) to address these challenges. Generalized Advantage Estimation relabels the dataset to enhance trajectory stitching effectiveness. MDI consists of an encoder and a decoder. The encoder integrates an intra-modal interaction mechanism based on ConvFormer and an inter-modal interaction mechanism based on a dual-Transformer architecture to enable information exchange within and across modalities. In intra-modal interaction, the convolutional properties of ConvFormer effectively capture the associative information within respective modalities of states and actions. In inter-modal interaction, the dual-Transformer architecture facilitates multimodal information exchange for states and actions separately, fully exploring potential correlations between different modal data to achieve deep cross-modal information interaction. The decoder utilizes advantage values to optimize action prediction. We compared CGM with state-of-the-art baseline methods on the D4RL dataset. On the MuJoCo dataset, our proposed method outperforms the optimal comparison method by 2.89% in performance. |
|---|---|
| ISSN: | 2045-2322 |