RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints

Abstract Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited...

Full description

Saved in:
Bibliographic Details
Main Authors: Yafeng Deng, Xinda Zhao, Hanyu Sun, Yu Chen, Xiaorui Wang, Xi Xue, Liangning Li, Jianfei Song, Chang-Yu Hsieh, Tingjun Hou, Xiandao Pan, Taghrid Saad Alomar, Xiangyang Ji, Xiaojian Wang
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-62308-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849234647330848768
author Yafeng Deng
Xinda Zhao
Hanyu Sun
Yu Chen
Xiaorui Wang
Xi Xue
Liangning Li
Jianfei Song
Chang-Yu Hsieh
Tingjun Hou
Xiandao Pan
Taghrid Saad Alomar
Xiangyang Ji
Xiaojian Wang
author_facet Yafeng Deng
Xinda Zhao
Hanyu Sun
Yu Chen
Xiaorui Wang
Xi Xue
Liangning Li
Jianfei Song
Chang-Yu Hsieh
Tingjun Hou
Xiandao Pan
Taghrid Saad Alomar
Xiangyang Ji
Xiaojian Wang
author_sort Yafeng Deng
collection DOAJ
description Abstract Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models.
format Article
id doaj-art-baa619b0ced246dbabf59d0ce34210cc
institution Kabale University
issn 2041-1723
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-baa619b0ced246dbabf59d0ce34210cc2025-08-20T04:03:06ZengNature PortfolioNature Communications2041-17232025-07-0116111410.1038/s41467-025-62308-6RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapointsYafeng Deng0Xinda Zhao1Hanyu Sun2Yu Chen3Xiaorui Wang4Xi Xue5Liangning Li6Jianfei Song7Chang-Yu Hsieh8Tingjun Hou9Xiandao Pan10Taghrid Saad Alomar11Xiangyang Ji12Xiaojian Wang13Department of Automation, Tsinghua UniversityHangzhou Carbonsilicon AI Technology Co., LtdState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesHangzhou Carbonsilicon AI Technology Co., LtdInnovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesHangzhou Carbonsilicon AI Technology Co., LtdHangzhou Carbonsilicon AI Technology Co., LtdHangzhou Carbonsilicon AI Technology Co., LtdState Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical SciencesDepartment of Chemistry, College of Science, Princess Nourah bint Abdulrahman UniversityDepartment of Automation, Tsinghua UniversityHangzhou Carbonsilicon AI Technology Co., LtdAbstract Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models.https://doi.org/10.1038/s41467-025-62308-6
spellingShingle Yafeng Deng
Xinda Zhao
Hanyu Sun
Yu Chen
Xiaorui Wang
Xi Xue
Liangning Li
Jianfei Song
Chang-Yu Hsieh
Tingjun Hou
Xiandao Pan
Taghrid Saad Alomar
Xiangyang Ji
Xiaojian Wang
RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
Nature Communications
title RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
title_full RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
title_fullStr RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
title_full_unstemmed RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
title_short RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints
title_sort rsgpt a generative transformer model for retrosynthesis planning pre trained on ten billion datapoints
url https://doi.org/10.1038/s41467-025-62308-6
work_keys_str_mv AT yafengdeng rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT xindazhao rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT hanyusun rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT yuchen rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT xiaoruiwang rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT xixue rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT liangningli rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT jianfeisong rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT changyuhsieh rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT tingjunhou rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT xiandaopan rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT taghridsaadalomar rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT xiangyangji rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints
AT xiaojianwang rsgptagenerativetransformermodelforretrosynthesisplanningpretrainedontenbilliondatapoints