Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectoriz...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11108167/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849390201529434112 |
|---|---|
| author | Fernando Rezende Zagatti Gilson Yuuji Shimizu Daniel Lucredio Helena de Medeiros Caseli |
| author_facet | Fernando Rezende Zagatti Gilson Yuuji Shimizu Daniel Lucredio Helena de Medeiros Caseli |
| author_sort | Fernando Rezende Zagatti |
| collection | DOAJ |
| description | In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectorization configurations–specifically, Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec–by analyzing the average cosine similarity of the generated vectors; by preselecting configurations that yield more diverse textual representations, our method relies on the hypothesis that increased diversity in text representations enhances the discriminative capacity of classification models. Experimental evaluations conducted on five different datasets demonstrate that the similarity-based approach achieves accuracy and F1-Score results very close to those obtained via exhaustive search, with notable reductions in processing time; furthermore, correlation analyses reveal a strong inverse relationship between vector similarity and model performance for BoW and TF-IDF, and a moderate relationship for Word2Vec. These findings validate the efficacy of the proposed method as a practical alternative to hyperparameter selection in vectorization pipelines, offering significant benefits for applications where exhaustive exploration is unfeasible. |
| format | Article |
| id | doaj-art-30e6947c6f774cc7b2df1b620ca726f5 |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-30e6947c6f774cc7b2df1b620ca726f52025-08-20T03:41:44ZengIEEEIEEE Access2169-35362025-01-011313734813736310.1109/ACCESS.2025.359542311108167Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification PerformanceFernando Rezende Zagatti0https://orcid.org/0000-0002-7083-5789Gilson Yuuji Shimizu1https://orcid.org/0000-0003-3711-5592Daniel Lucredio2https://orcid.org/0000-0002-1360-4036Helena de Medeiros Caseli3https://orcid.org/0000-0003-3996-8599Department of Computing, Federal University of São Carlos, São Carlos, São Paulo, BrazilDIMEC, Center for Information Technology Renato Archer, Campinas, São Paulo, BrazilDepartment of Computing, Federal University of São Carlos, São Carlos, São Paulo, BrazilDepartment of Computing, Federal University of São Carlos, São Carlos, São Paulo, BrazilIn recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectorization configurations–specifically, Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec–by analyzing the average cosine similarity of the generated vectors; by preselecting configurations that yield more diverse textual representations, our method relies on the hypothesis that increased diversity in text representations enhances the discriminative capacity of classification models. Experimental evaluations conducted on five different datasets demonstrate that the similarity-based approach achieves accuracy and F1-Score results very close to those obtained via exhaustive search, with notable reductions in processing time; furthermore, correlation analyses reveal a strong inverse relationship between vector similarity and model performance for BoW and TF-IDF, and a moderate relationship for Word2Vec. These findings validate the efficacy of the proposed method as a practical alternative to hyperparameter selection in vectorization pipelines, offering significant benefits for applications where exhaustive exploration is unfeasible.https://ieeexplore.ieee.org/document/11108167/Natural language processingcosine similarityvectorizationhyperparameter tuning |
| spellingShingle | Fernando Rezende Zagatti Gilson Yuuji Shimizu Daniel Lucredio Helena de Medeiros Caseli Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance IEEE Access Natural language processing cosine similarity vectorization hyperparameter tuning |
| title | Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance |
| title_full | Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance |
| title_fullStr | Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance |
| title_full_unstemmed | Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance |
| title_short | Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance |
| title_sort | investigating the relationship between text vectorization cosine similarity and classification performance |
| topic | Natural language processing cosine similarity vectorization hyperparameter tuning |
| url | https://ieeexplore.ieee.org/document/11108167/ |
| work_keys_str_mv | AT fernandorezendezagatti investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance AT gilsonyuujishimizu investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance AT daniellucredio investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance AT helenademedeiroscaseli investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance |