Positional embeddings and zero-shot learning using BERT for molecular-property prediction

Abstract Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input li...

Full description

Saved in:

Bibliographic Details
Main Authors:	Medard Edmund Mswahili, JunHa Hwang, Jagath C. Rajapakse, Kyuri Jo, Young-Seob Jeong
Format:	Article
Language:	English
Published:	BMC 2025-02-01
Series:	Journal of Cheminformatics
Subjects:	Transformers BERT Positional embedding/encoding Zero-shot learning Molecular-property prediction SMILES
Online Access:	https://doi.org/10.1186/s13321-025-00959-9
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823861650348834816
author	Medard Edmund Mswahili JunHa Hwang Jagath C. Rajapakse Kyuri Jo Young-Seob Jeong
author_facet	Medard Edmund Mswahili JunHa Hwang Jagath C. Rajapakse Kyuri Jo Young-Seob Jeong
author_sort	Medard Edmund Mswahili
collection	DOAJ
description	Abstract Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input line entry system (SMILES) data, particularly in text analysis tasks. These advancements have driven the need to optimize components like positional encoding and positional embeddings (PEs) in transformer model to better capture the sequential and contextual information embedded in molecular representations. SMILES data represent complex relationships among atoms or elements, rendering them critical for various learning tasks within the field of cheminformatics. This study addresses the critical challenge of encoding complex relationships among atoms in SMILES strings to explore various PEs within the transformer-based framework to increase the accuracy and generalization of molecular property predictions. The success of transformer-based models, such as the bidirectional encoder representations from transformer (BERT) models, in natural language processing tasks has sparked growing interest from the domain of cheminformatics. However, the performance of these models during pretraining and fine-tuning is significantly influenced by positional information such as PEs, which help in understanding the intricate relationships within sequences. Integrating position information within transformer architectures has emerged as a promising approach. This encoding mechanism provides essential supervision for modeling dependencies among elements situated at different positions within a given sequence. In this study, we first conduct pretraining experiments using various PEs to explore diverse methodologies for incorporating positional information into the BERT model for chemical text analysis using SMILES strings. Next, for each PE, we fine-tune the best-performing BERT (masked language modeling) model on downstream tasks for molecular-property prediction. Here, we use two molecular representations, SMILES and DeepSMILES, to comprehensively assess the potential and limitations of the PEs in zero-shot learning analysis, demonstrating the model’s proficiency in predicting properties of unseen molecular representations in the context of newly proposed and existing datasets. Scientific contribution This study explores the unexplored potential of PEs using BERT model for molecular property prediction. The study involved pretraining and fine-tuning the BERT model on various datasets related to COVID-19, bioassay data, and other molecular and biological properties using SMILES and DeepSMILES representations. The study details the pretraining architecture, fine-tuning datasets, and the performance of the BERT model with different PEs. It also explores zero-shot learning analysis and the model’s performance on various classification and regression tasks. In this study, newly proposed datasets from different domains were introduced during fine-tuning in addition to the existing and commonly used datasets. The study highlights the robustness of the BERT model in predicting chemical properties and its potential applications in cheminformatics and bioinformatics.
format	Article
id	doaj-art-80f2673127b54cec9f3da71aacd9a3e5
institution	Kabale University
issn	1758-2946
language	English
publishDate	2025-02-01
publisher	BMC
record_format	Article
series	Journal of Cheminformatics
spelling	doaj-art-80f2673127b54cec9f3da71aacd9a3e52025-02-09T12:52:17ZengBMCJournal of Cheminformatics1758-29462025-02-0117112210.1186/s13321-025-00959-9Positional embeddings and zero-shot learning using BERT for molecular-property predictionMedard Edmund Mswahili0JunHa Hwang1Jagath C. Rajapakse2Kyuri Jo3Young-Seob Jeong4Department of Computer Engineering, Chungbuk National UniversityDepartment of Computer Engineering, Chungbuk National UniversitySchool of Computer Science and Engineering, Nanyang Technological UniversityDepartment of Computer Engineering, Chungbuk National UniversityDepartment of Computer Engineering, Chungbuk National UniversityAbstract Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input line entry system (SMILES) data, particularly in text analysis tasks. These advancements have driven the need to optimize components like positional encoding and positional embeddings (PEs) in transformer model to better capture the sequential and contextual information embedded in molecular representations. SMILES data represent complex relationships among atoms or elements, rendering them critical for various learning tasks within the field of cheminformatics. This study addresses the critical challenge of encoding complex relationships among atoms in SMILES strings to explore various PEs within the transformer-based framework to increase the accuracy and generalization of molecular property predictions. The success of transformer-based models, such as the bidirectional encoder representations from transformer (BERT) models, in natural language processing tasks has sparked growing interest from the domain of cheminformatics. However, the performance of these models during pretraining and fine-tuning is significantly influenced by positional information such as PEs, which help in understanding the intricate relationships within sequences. Integrating position information within transformer architectures has emerged as a promising approach. This encoding mechanism provides essential supervision for modeling dependencies among elements situated at different positions within a given sequence. In this study, we first conduct pretraining experiments using various PEs to explore diverse methodologies for incorporating positional information into the BERT model for chemical text analysis using SMILES strings. Next, for each PE, we fine-tune the best-performing BERT (masked language modeling) model on downstream tasks for molecular-property prediction. Here, we use two molecular representations, SMILES and DeepSMILES, to comprehensively assess the potential and limitations of the PEs in zero-shot learning analysis, demonstrating the model’s proficiency in predicting properties of unseen molecular representations in the context of newly proposed and existing datasets. Scientific contribution This study explores the unexplored potential of PEs using BERT model for molecular property prediction. The study involved pretraining and fine-tuning the BERT model on various datasets related to COVID-19, bioassay data, and other molecular and biological properties using SMILES and DeepSMILES representations. The study details the pretraining architecture, fine-tuning datasets, and the performance of the BERT model with different PEs. It also explores zero-shot learning analysis and the model’s performance on various classification and regression tasks. In this study, newly proposed datasets from different domains were introduced during fine-tuning in addition to the existing and commonly used datasets. The study highlights the robustness of the BERT model in predicting chemical properties and its potential applications in cheminformatics and bioinformatics.https://doi.org/10.1186/s13321-025-00959-9TransformersBERTPositional embedding/encodingZero-shot learningMolecular-property predictionSMILES
spellingShingle	Medard Edmund Mswahili JunHa Hwang Jagath C. Rajapakse Kyuri Jo Young-Seob Jeong Positional embeddings and zero-shot learning using BERT for molecular-property prediction Journal of Cheminformatics Transformers BERT Positional embedding/encoding Zero-shot learning Molecular-property prediction SMILES
title	Positional embeddings and zero-shot learning using BERT for molecular-property prediction
title_full	Positional embeddings and zero-shot learning using BERT for molecular-property prediction
title_fullStr	Positional embeddings and zero-shot learning using BERT for molecular-property prediction
title_full_unstemmed	Positional embeddings and zero-shot learning using BERT for molecular-property prediction
title_short	Positional embeddings and zero-shot learning using BERT for molecular-property prediction
title_sort	positional embeddings and zero shot learning using bert for molecular property prediction
topic	Transformers BERT Positional embedding/encoding Zero-shot learning Molecular-property prediction SMILES
url	https://doi.org/10.1186/s13321-025-00959-9
work_keys_str_mv	AT medardedmundmswahili positionalembeddingsandzeroshotlearningusingbertformolecularpropertyprediction AT junhahwang positionalembeddingsandzeroshotlearningusingbertformolecularpropertyprediction AT jagathcrajapakse positionalembeddingsandzeroshotlearningusingbertformolecularpropertyprediction AT kyurijo positionalembeddingsandzeroshotlearningusingbertformolecularpropertyprediction AT youngseobjeong positionalembeddingsandzeroshotlearningusingbertformolecularpropertyprediction

Positional embeddings and zero-shot learning using BERT for molecular-property prediction

Similar Items