Contrastive learning and mixture of experts enables precise vector embeddings in biological databases

Abstract The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific liter...

Full description

Saved in:
Bibliographic Details
Main Authors: Logan Hallee, Rohan Kapur, Arjun Patel, Jason P. Gleghorn, Bohdan B. Khomtchouk
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-98185-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849314778172882944
author Logan Hallee
Rohan Kapur
Arjun Patel
Jason P. Gleghorn
Bohdan B. Khomtchouk
author_facet Logan Hallee
Rohan Kapur
Arjun Patel
Jason P. Gleghorn
Bohdan B. Khomtchouk
author_sort Logan Hallee
collection DOAJ
description Abstract The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific literature. With the increased reliance on retrieval augmentation and search, representing structurally and thematically-varied research documents as concise and descriptive vectors is crucial. This study improves upon the vector embeddings of scientific text by assembling domain-specific datasets using co-citations as a similarity metric, focusing on biomedical domains. We introduce a novel Mixture of Experts (MoE) extension pipeline applied to pretrained BERT models, where every multi-layer perceptron section is copied into distinct experts. Our MoE variants are trained to classify whether two publications are cited together (co-cited) in a third paper based on their scientific abstracts across multiple biological domains. Notably, because of our unique routing scheme based on special tokens, the throughput of our extended MoE system is exactly the same as regular transformers. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for encoding heterogeneous biomedical inputs. Our methodology marks advancements in representation learning and holds promise for enhancing vector database search and compilation.
format Article
id doaj-art-83e55b5fd4a343ffa6d4e09b05309403
institution Kabale University
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-83e55b5fd4a343ffa6d4e09b053094032025-08-20T03:52:20ZengNature PortfolioScientific Reports2045-23222025-04-0115111210.1038/s41598-025-98185-8Contrastive learning and mixture of experts enables precise vector embeddings in biological databasesLogan Hallee0Rohan Kapur1Arjun Patel2Jason P. Gleghorn3Bohdan B. Khomtchouk4Center for Bioinformatics and Computational Biology, University of DelawareLincoln Laboratory, Massachusetts Institute of TechnologyThe College of the University of ChicagoDepartment of Biomedical Engineering, University of DelawareDepartment of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana UniversityAbstract The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific literature. With the increased reliance on retrieval augmentation and search, representing structurally and thematically-varied research documents as concise and descriptive vectors is crucial. This study improves upon the vector embeddings of scientific text by assembling domain-specific datasets using co-citations as a similarity metric, focusing on biomedical domains. We introduce a novel Mixture of Experts (MoE) extension pipeline applied to pretrained BERT models, where every multi-layer perceptron section is copied into distinct experts. Our MoE variants are trained to classify whether two publications are cited together (co-cited) in a third paper based on their scientific abstracts across multiple biological domains. Notably, because of our unique routing scheme based on special tokens, the throughput of our extended MoE system is exactly the same as regular transformers. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for encoding heterogeneous biomedical inputs. Our methodology marks advancements in representation learning and holds promise for enhancing vector database search and compilation.https://doi.org/10.1038/s41598-025-98185-8Natural language processingBiomedical literatureBiological databasesMachine learning
spellingShingle Logan Hallee
Rohan Kapur
Arjun Patel
Jason P. Gleghorn
Bohdan B. Khomtchouk
Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
Scientific Reports
Natural language processing
Biomedical literature
Biological databases
Machine learning
title Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
title_full Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
title_fullStr Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
title_full_unstemmed Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
title_short Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
title_sort contrastive learning and mixture of experts enables precise vector embeddings in biological databases
topic Natural language processing
Biomedical literature
Biological databases
Machine learning
url https://doi.org/10.1038/s41598-025-98185-8
work_keys_str_mv AT loganhallee contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases
AT rohankapur contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases
AT arjunpatel contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases
AT jasonpgleghorn contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases
AT bohdanbkhomtchouk contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases