Instruction multi-constraint molecular generation using a teacher-student large language model

Abstract Background While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Results We introduce a multi-constraint molecular generation large lan...

Full description

Saved in:
Bibliographic Details
Main Authors: Peng Zhou, Jianmin Wang, Chunyan Li, Zixu Wang, Yiping Liu, Siqi Sun, Jianxin Lin, Leyi Wei, Xibao Cai, Houtim Lai, Wei Liu, Longyue Wang, Yuansheng Liu, Xiangxiang Zeng
Format: Article
Language:English
Published: BMC 2025-04-01
Series:BMC Biology
Subjects:
Online Access:https://doi.org/10.1186/s12915-025-02200-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850139054820032512
author Peng Zhou
Jianmin Wang
Chunyan Li
Zixu Wang
Yiping Liu
Siqi Sun
Jianxin Lin
Leyi Wei
Xibao Cai
Houtim Lai
Wei Liu
Longyue Wang
Yuansheng Liu
Xiangxiang Zeng
author_facet Peng Zhou
Jianmin Wang
Chunyan Li
Zixu Wang
Yiping Liu
Siqi Sun
Jianxin Lin
Leyi Wei
Xibao Cai
Houtim Lai
Wei Liu
Longyue Wang
Yuansheng Liu
Xiangxiang Zeng
author_sort Peng Zhou
collection DOAJ
description Abstract Background While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Results We introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the “teachers.” To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these “teachers,” enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules that meet complex property requirements described in natural language across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts. Conclusions TSMMG presents an effective model for multi-constraint molecular generation using natural language. This framework is not only applicable to drug discovery but also serves as a reference for other related fields.
format Article
id doaj-art-59baed04f1c94afc977fdcb61edddcf2
institution OA Journals
issn 1741-7007
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series BMC Biology
spelling doaj-art-59baed04f1c94afc977fdcb61edddcf22025-08-20T02:30:26ZengBMCBMC Biology1741-70072025-04-0123111710.1186/s12915-025-02200-3Instruction multi-constraint molecular generation using a teacher-student large language modelPeng Zhou0Jianmin Wang1Chunyan Li2Zixu Wang3Yiping Liu4Siqi Sun5Jianxin Lin6Leyi Wei7Xibao Cai8Houtim Lai9Wei Liu10Longyue Wang11Yuansheng Liu12Xiangxiang Zeng13College of Information Science and Engineering, Hunan UniversityThe Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei UniversitySchool of Informatics, Yunnan Normal UniversityDepartment of Computer Science, University of TsukubaCollege of Information Science and Engineering, Hunan UniversityResearch Institute of Intelligent Complex Systems, Fudan UniversityCollege of Information Science and Engineering, Hunan UniversityCentre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic UniversityCollege of Information Science and Engineering, Hunan UniversityAI for Life Sciences Lab, TencentAI for Life Sciences Lab, TencentAlibaba International Digital CommerceCollege of Information Science and Engineering, Hunan UniversityCollege of Information Science and Engineering, Hunan UniversityAbstract Background While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Results We introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the “teachers.” To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these “teachers,” enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules that meet complex property requirements described in natural language across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts. Conclusions TSMMG presents an effective model for multi-constraint molecular generation using natural language. This framework is not only applicable to drug discovery but also serves as a reference for other related fields.https://doi.org/10.1186/s12915-025-02200-3Molecular generationLarge language modelMulti-constraint
spellingShingle Peng Zhou
Jianmin Wang
Chunyan Li
Zixu Wang
Yiping Liu
Siqi Sun
Jianxin Lin
Leyi Wei
Xibao Cai
Houtim Lai
Wei Liu
Longyue Wang
Yuansheng Liu
Xiangxiang Zeng
Instruction multi-constraint molecular generation using a teacher-student large language model
BMC Biology
Molecular generation
Large language model
Multi-constraint
title Instruction multi-constraint molecular generation using a teacher-student large language model
title_full Instruction multi-constraint molecular generation using a teacher-student large language model
title_fullStr Instruction multi-constraint molecular generation using a teacher-student large language model
title_full_unstemmed Instruction multi-constraint molecular generation using a teacher-student large language model
title_short Instruction multi-constraint molecular generation using a teacher-student large language model
title_sort instruction multi constraint molecular generation using a teacher student large language model
topic Molecular generation
Large language model
Multi-constraint
url https://doi.org/10.1186/s12915-025-02200-3
work_keys_str_mv AT pengzhou instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT jianminwang instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT chunyanli instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT zixuwang instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT yipingliu instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT siqisun instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT jianxinlin instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT leyiwei instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT xibaocai instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT houtimlai instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT weiliu instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT longyuewang instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT yuanshengliu instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel
AT xiangxiangzeng instructionmulticonstraintmoleculargenerationusingateacherstudentlargelanguagemodel