Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architect...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-06-01
|
| Series: | Natural Language Processing Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2949719125000305 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850218996216889344 |
|---|---|
| author | Mario Graff Daniela Moctezuma Eric S. Téllez |
| author_facet | Mario Graff Daniela Moctezuma Eric S. Téllez |
| author_sort | Mario Graff |
| collection | DOAJ |
| description | The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architectures for various natural language processing tasks, including text classification problems. For a newcomer, it might be impossible to realize that for some text classification problems, the traditional approach is still competitive. This research analyzes the competitiveness of BoW-based representations in different text-classification competitions run in English, Spanish, and Italian. To analyze the performance of these BoW-based representations, we participated in 12 text classification international competitions, summing up 24 tasks comprising five English tasks, seven in Italian, and twelve in Spanish. The results show that the proposed BoW representations have a difference of just 10% w.r.t. the competition winner and less than 2% in three tasks corresponding to author profiling. BoW outperforms BERT solutions and dominates in author profiling tasks. |
| format | Article |
| id | doaj-art-28af9c34abd54ef0bbd067bbe2ce6170 |
| institution | OA Journals |
| issn | 2949-7191 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Natural Language Processing Journal |
| spelling | doaj-art-28af9c34abd54ef0bbd067bbe2ce61702025-08-20T02:07:31ZengElsevierNatural Language Processing Journal2949-71912025-06-011110015410.1016/j.nlp.2025.100154Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challengesMario Graff0Daniela Moctezuma1Eric S. Téllez2Consejo Nacional de Humanidades, Ciencia y Tecnología (CONAHCYT), Ciudad de México, Mexico; INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Aguascalientes, MexicoCentroGEO Centro de Investigación en Ciencias de Información Geoespacial, Aguascalientes, MexicoConsejo Nacional de Humanidades, Ciencia y Tecnología (CONAHCYT), Ciudad de México, Mexico; INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Aguascalientes, Mexico; Correspondence to: Circuito Tecnopolo Sur No. 112, Fraccionamiento Tecnopolo Pocitos, C.P. 20313, Aguascalientes, Ags., Mexico.The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architectures for various natural language processing tasks, including text classification problems. For a newcomer, it might be impossible to realize that for some text classification problems, the traditional approach is still competitive. This research analyzes the competitiveness of BoW-based representations in different text-classification competitions run in English, Spanish, and Italian. To analyze the performance of these BoW-based representations, we participated in 12 text classification international competitions, summing up 24 tasks comprising five English tasks, seven in Italian, and twelve in Spanish. The results show that the proposed BoW representations have a difference of just 10% w.r.t. the competition winner and less than 2% in three tasks corresponding to author profiling. BoW outperforms BERT solutions and dominates in author profiling tasks.http://www.sciencedirect.com/science/article/pii/S2949719125000305Text classificationLexical and semantic Bag of WordsStack generalizationExplainable models |
| spellingShingle | Mario Graff Daniela Moctezuma Eric S. Téllez Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges Natural Language Processing Journal Text classification Lexical and semantic Bag of Words Stack generalization Explainable models |
| title | Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges |
| title_full | Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges |
| title_fullStr | Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges |
| title_full_unstemmed | Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges |
| title_short | Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges |
| title_sort | bag of word approach is not dead a performance analysis on a myriad of text classification challenges |
| topic | Text classification Lexical and semantic Bag of Words Stack generalization Explainable models |
| url | http://www.sciencedirect.com/science/article/pii/S2949719125000305 |
| work_keys_str_mv | AT mariograff bagofwordapproachisnotdeadaperformanceanalysisonamyriadoftextclassificationchallenges AT danielamoctezuma bagofwordapproachisnotdeadaperformanceanalysisonamyriadoftextclassificationchallenges AT ericstellez bagofwordapproachisnotdeadaperformanceanalysisonamyriadoftextclassificationchallenges |