RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
Abstract Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited...
Saved in:
| Main Authors: | , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Nature Communications |
| Online Access: | https://doi.org/10.1038/s41467-025-62308-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849234647330848768 |
|---|---|
| author | Yafeng Deng Xinda Zhao Hanyu Sun Yu Chen Xiaorui Wang Xi Xue Liangning Li Jianfei Song Chang-Yu Hsieh Tingjun Hou Xiandao Pan Taghrid Saad Alomar Xiangyang Ji Xiaojian Wang |
| author_facet | Yafeng Deng Xinda Zhao Hanyu Sun Yu Chen Xiaorui Wang Xi Xue Liangning Li Jianfei Song Chang-Yu Hsieh Tingjun Hou Xiandao Pan Taghrid Saad Alomar Xiangyang Ji Xiaojian Wang |
| author_sort | Yafeng Deng |
| collection | DOAJ |
| description | Abstract Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models. |
| format | Article |
| id | doaj-art-baa619b0ced246dbabf59d0ce34210cc |
| institution | Kabale University |
| issn | 2041-1723 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Nature Communications |
| spelling | doaj-art-baa619b0ced246dbabf59d0ce34210cc2025-08-20T04:03:06ZengNature PortfolioNature Communications2041-17232025-07-0116111410.1038/s41467-025-62308-6RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapointsYafeng Deng0Xinda Zhao1Hanyu Sun2Yu Chen3Xiaorui Wang4Xi Xue5Liangning Li6Jianfei Song7Chang-Yu Hsieh8Tingjun Hou9Xiandao Pan10Taghrid Saad Alomar11Xiangyang Ji12Xiaojian Wang13Department of Automation, Tsinghua UniversityHangzhou Carbonsilicon AI Technology Co., LtdState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesHangzhou Carbonsilicon AI Technology Co., LtdInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesHangzhou Carbonsilicon AI Technology Co., LtdHangzhou Carbonsilicon AI Technology Co., LtdHangzhou Carbonsilicon AI Technology Co., LtdState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesDepartment of Chemistry, College of Science, Princess Nourah bint Abdulrahman UniversityDepartment of Automation, Tsinghua UniversityHangzhou Carbonsilicon AI Technology Co., LtdAbstract Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models.https://doi.org/10.1038/s41467-025-62308-6 |
| spellingShingle | Yafeng Deng Xinda Zhao Hanyu Sun Yu Chen Xiaorui Wang Xi Xue Liangning Li Jianfei Song Chang-Yu Hsieh Tingjun Hou Xiandao Pan Taghrid Saad Alomar Xiangyang Ji Xiaojian Wang RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints Nature Communications |
| title | RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints |
| title_full | RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints |
| title_fullStr | RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints |
| title_full_unstemmed | RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints |
| title_short | RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints |
| title_sort | rsgpt a generative transformer model for retrosynthesis planning pre trained on ten billion datapoints |
| url | https://doi.org/10.1038/s41467-025-62308-6 |
| work_keys_str_mv | AT yafengdeng rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT xindazhao rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT hanyusun rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT yuchen rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT xiaoruiwang rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT xixue rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT liangningli rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT jianfeisong rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT changyuhsieh rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT tingjunhou rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT xiandaopan rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT taghridsaadalomar rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT xiangyangji rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints AT xiaojianwang rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints |