Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance

In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectoriz...

Full description

Saved in:
Bibliographic Details
Main Authors: Fernando Rezende Zagatti, Gilson Yuuji Shimizu, Daniel Lucredio, Helena de Medeiros Caseli
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11108167/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849390201529434112
author Fernando Rezende Zagatti
Gilson Yuuji Shimizu
Daniel Lucredio
Helena de Medeiros Caseli
author_facet Fernando Rezende Zagatti
Gilson Yuuji Shimizu
Daniel Lucredio
Helena de Medeiros Caseli
author_sort Fernando Rezende Zagatti
collection DOAJ
description In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectorization configurations–specifically, Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec–by analyzing the average cosine similarity of the generated vectors; by preselecting configurations that yield more diverse textual representations, our method relies on the hypothesis that increased diversity in text representations enhances the discriminative capacity of classification models. Experimental evaluations conducted on five different datasets demonstrate that the similarity-based approach achieves accuracy and F1-Score results very close to those obtained via exhaustive search, with notable reductions in processing time; furthermore, correlation analyses reveal a strong inverse relationship between vector similarity and model performance for BoW and TF-IDF, and a moderate relationship for Word2Vec. These findings validate the efficacy of the proposed method as a practical alternative to hyperparameter selection in vectorization pipelines, offering significant benefits for applications where exhaustive exploration is unfeasible.
format Article
id doaj-art-30e6947c6f774cc7b2df1b620ca726f5
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-30e6947c6f774cc7b2df1b620ca726f52025-08-20T03:41:44ZengIEEEIEEE Access2169-35362025-01-011313734813736310.1109/ACCESS.2025.359542311108167Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification PerformanceFernando Rezende Zagatti0https://orcid.org/0000-0002-7083-5789Gilson Yuuji Shimizu1https://orcid.org/0000-0003-3711-5592Daniel Lucredio2https://orcid.org/0000-0002-1360-4036Helena de Medeiros Caseli3https://orcid.org/0000-0003-3996-8599Department of Computing, Federal University of São Carlos, São Carlos, São Paulo, BrazilDIMEC, Center for Information Technology Renato Archer, Campinas, São Paulo, BrazilDepartment of Computing, Federal University of São Carlos, São Carlos, São Paulo, BrazilDepartment of Computing, Federal University of São Carlos, São Carlos, São Paulo, BrazilIn recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectorization configurations–specifically, Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec–by analyzing the average cosine similarity of the generated vectors; by preselecting configurations that yield more diverse textual representations, our method relies on the hypothesis that increased diversity in text representations enhances the discriminative capacity of classification models. Experimental evaluations conducted on five different datasets demonstrate that the similarity-based approach achieves accuracy and F1-Score results very close to those obtained via exhaustive search, with notable reductions in processing time; furthermore, correlation analyses reveal a strong inverse relationship between vector similarity and model performance for BoW and TF-IDF, and a moderate relationship for Word2Vec. These findings validate the efficacy of the proposed method as a practical alternative to hyperparameter selection in vectorization pipelines, offering significant benefits for applications where exhaustive exploration is unfeasible.https://ieeexplore.ieee.org/document/11108167/Natural language processingcosine similarityvectorizationhyperparameter tuning
spellingShingle Fernando Rezende Zagatti
Gilson Yuuji Shimizu
Daniel Lucredio
Helena de Medeiros Caseli
Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
IEEE Access
Natural language processing
cosine similarity
vectorization
hyperparameter tuning
title Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
title_full Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
title_fullStr Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
title_full_unstemmed Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
title_short Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
title_sort investigating the relationship between text vectorization cosine similarity and classification performance
topic Natural language processing
cosine similarity
vectorization
hyperparameter tuning
url https://ieeexplore.ieee.org/document/11108167/
work_keys_str_mv AT fernandorezendezagatti investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance
AT gilsonyuujishimizu investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance
AT daniellucredio investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance
AT helenademedeiroscaseli investigatingtherelationshipbetweentextvectorizationcosinesimilarityandclassificationperformance