Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis

While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, whic...

Full description

Saved in:
Bibliographic Details
Main Authors: Dailin Gan, Jun Li
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037025003137
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849389594979598336
author Dailin Gan
Jun Li
author_facet Dailin Gan
Jun Li
author_sort Dailin Gan
collection DOAJ
description While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, which is in textual form, about genes. However, the closed-source, online nature of OpenAI's text-embedding service raises concerns regarding data privacy, among other issues. In this paper, we explore the possibility of replacing OpenAI's models with open-source transformer-based text-embedding models. We identified ten models from Hugging Face that are small in size, easy to install, and light in computation. Across all four gene classification tasks we considered, some of these models have outperformed OpenAI's, demonstrating their potential as viable, or even superior, alternatives. Additionally, we find that fine-tuning these models often does not lead to significant improvements in performance.
format Article
id doaj-art-ae33e790a02f489a836bb8c2bab63d69
institution Kabale University
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-ae33e790a02f489a836bb8c2bab63d692025-08-20T03:41:54ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01273598360810.1016/j.csbj.2025.07.053Small, open-source text-embedding models as substitutes to OpenAI models for gene analysisDailin Gan0Jun Li1Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USACorresponding author.; Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USAWhile foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, which is in textual form, about genes. However, the closed-source, online nature of OpenAI's text-embedding service raises concerns regarding data privacy, among other issues. In this paper, we explore the possibility of replacing OpenAI's models with open-source transformer-based text-embedding models. We identified ten models from Hugging Face that are small in size, easy to install, and light in computation. Across all four gene classification tasks we considered, some of these models have outperformed OpenAI's, demonstrating their potential as viable, or even superior, alternatives. Additionally, we find that fine-tuning these models often does not lead to significant improvements in performance.http://www.sciencedirect.com/science/article/pii/S2001037025003137
spellingShingle Dailin Gan
Jun Li
Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
Computational and Structural Biotechnology Journal
title Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
title_full Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
title_fullStr Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
title_full_unstemmed Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
title_short Small, open-source text-embedding models as substitutes to OpenAI models for gene analysis
title_sort small open source text embedding models as substitutes to openai models for gene analysis
url http://www.sciencedirect.com/science/article/pii/S2001037025003137
work_keys_str_mv AT dailingan smallopensourcetextembeddingmodelsassubstitutestoopenaimodelsforgeneanalysis
AT junli smallopensourcetextembeddingmodelsassubstitutestoopenaimodelsforgeneanalysis