Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
Abstract The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific liter...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-98185-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849314778172882944 |
|---|---|
| author | Logan Hallee Rohan Kapur Arjun Patel Jason P. Gleghorn Bohdan B. Khomtchouk |
| author_facet | Logan Hallee Rohan Kapur Arjun Patel Jason P. Gleghorn Bohdan B. Khomtchouk |
| author_sort | Logan Hallee |
| collection | DOAJ |
| description | Abstract The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific literature. With the increased reliance on retrieval augmentation and search, representing structurally and thematically-varied research documents as concise and descriptive vectors is crucial. This study improves upon the vector embeddings of scientific text by assembling domain-specific datasets using co-citations as a similarity metric, focusing on biomedical domains. We introduce a novel Mixture of Experts (MoE) extension pipeline applied to pretrained BERT models, where every multi-layer perceptron section is copied into distinct experts. Our MoE variants are trained to classify whether two publications are cited together (co-cited) in a third paper based on their scientific abstracts across multiple biological domains. Notably, because of our unique routing scheme based on special tokens, the throughput of our extended MoE system is exactly the same as regular transformers. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for encoding heterogeneous biomedical inputs. Our methodology marks advancements in representation learning and holds promise for enhancing vector database search and compilation. |
| format | Article |
| id | doaj-art-83e55b5fd4a343ffa6d4e09b05309403 |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-83e55b5fd4a343ffa6d4e09b053094032025-08-20T03:52:20ZengNature PortfolioScientific Reports2045-23222025-04-0115111210.1038/s41598-025-98185-8Contrastive learning and mixture of experts enables precise vector embeddings in biological databasesLogan Hallee0Rohan Kapur1Arjun Patel2Jason P. Gleghorn3Bohdan B. Khomtchouk4Center for Bioinformatics and Computational Biology, University of DelawareLincoln Laboratory, Massachusetts Institute of TechnologyThe College of the University of ChicagoDepartment of Biomedical Engineering, University of DelawareDepartment of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana UniversityAbstract The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific literature. With the increased reliance on retrieval augmentation and search, representing structurally and thematically-varied research documents as concise and descriptive vectors is crucial. This study improves upon the vector embeddings of scientific text by assembling domain-specific datasets using co-citations as a similarity metric, focusing on biomedical domains. We introduce a novel Mixture of Experts (MoE) extension pipeline applied to pretrained BERT models, where every multi-layer perceptron section is copied into distinct experts. Our MoE variants are trained to classify whether two publications are cited together (co-cited) in a third paper based on their scientific abstracts across multiple biological domains. Notably, because of our unique routing scheme based on special tokens, the throughput of our extended MoE system is exactly the same as regular transformers. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for encoding heterogeneous biomedical inputs. Our methodology marks advancements in representation learning and holds promise for enhancing vector database search and compilation.https://doi.org/10.1038/s41598-025-98185-8Natural language processingBiomedical literatureBiological databasesMachine learning |
| spellingShingle | Logan Hallee Rohan Kapur Arjun Patel Jason P. Gleghorn Bohdan B. Khomtchouk Contrastive learning and mixture of experts enables precise vector embeddings in biological databases Scientific Reports Natural language processing Biomedical literature Biological databases Machine learning |
| title | Contrastive learning and mixture of experts enables precise vector embeddings in biological databases |
| title_full | Contrastive learning and mixture of experts enables precise vector embeddings in biological databases |
| title_fullStr | Contrastive learning and mixture of experts enables precise vector embeddings in biological databases |
| title_full_unstemmed | Contrastive learning and mixture of experts enables precise vector embeddings in biological databases |
| title_short | Contrastive learning and mixture of experts enables precise vector embeddings in biological databases |
| title_sort | contrastive learning and mixture of experts enables precise vector embeddings in biological databases |
| topic | Natural language processing Biomedical literature Biological databases Machine learning |
| url | https://doi.org/10.1038/s41598-025-98185-8 |
| work_keys_str_mv | AT loganhallee contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases AT rohankapur contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases AT arjunpatel contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases AT jasonpgleghorn contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases AT bohdanbkhomtchouk contrastivelearningandmixtureofexpertsenablesprecisevectorembeddingsinbiologicaldatabases |