Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, whic...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-01-01
|
| Series: | Computational and Structural Biotechnology Journal |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037025003137 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849389594979598336 |
|---|---|
| author | Dailin Gan Jun Li |
| author_facet | Dailin Gan Jun Li |
| author_sort | Dailin Gan |
| collection | DOAJ |
| description | While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, which is in textual form, about genes. However, the closed-source, online nature of OpenAI's text-embedding service raises concerns regarding data privacy, among other issues. In this paper, we explore the possibility of replacing OpenAI's models with open-source transformer-based text-embedding models. We identified ten models from Hugging Face that are small in size, easy to install, and light in computation. Across all four gene classification tasks we considered, some of these models have outperformed OpenAI's, demonstrating their potential as viable, or even superior, alternatives. Additionally, we find that fine-tuning these models often does not lead to significant improvements in performance. |
| format | Article |
| id | doaj-art-ae33e790a02f489a836bb8c2bab63d69 |
| institution | Kabale University |
| issn | 2001-0370 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computational and Structural Biotechnology Journal |
| spelling | doaj-art-ae33e790a02f489a836bb8c2bab63d692025-08-20T03:41:54ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01273598360810.1016/j.csbj.2025.07.053Small, open-source text-embedding models as substitutes to OpenAI models for gene analysisDailin Gan0Jun Li1Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USACorresponding author.; Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USAWhile foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, which is in textual form, about genes. However, the closed-source, online nature of OpenAI's text-embedding service raises concerns regarding data privacy, among other issues. In this paper, we explore the possibility of replacing OpenAI's models with open-source transformer-based text-embedding models. We identified ten models from Hugging Face that are small in size, easy to install, and light in computation. Across all four gene classification tasks we considered, some of these models have outperformed OpenAI's, demonstrating their potential as viable, or even superior, alternatives. Additionally, we find that fine-tuning these models often does not lead to significant improvements in performance.http://www.sciencedirect.com/science/article/pii/S2001037025003137 |
| spellingShingle | Dailin Gan Jun Li Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis Computational and Structural Biotechnology Journal |
| title | Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis |
| title_full | Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis |
| title_fullStr | Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis |
| title_full_unstemmed | Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis |
| title_short | Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis |
| title_sort | small open source text embedding models as substitutes to openai models for gene analysis |
| url | http://www.sciencedirect.com/science/article/pii/S2001037025003137 |
| work_keys_str_mv | AT dailingan smallopensourcetextembeddingmodelsassubstitutestoopenaimodelsforgeneanalysis AT junli smallopensourcetextembeddingmodelsassubstitutestoopenaimodelsforgeneanalysis |