Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges

The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architect...

Full description

Saved in:
Bibliographic Details
Main Authors: Mario Graff, Daniela Moctezuma, Eric S. Téllez
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Natural Language Processing Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2949719125000305
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850218996216889344
author Mario Graff
Daniela Moctezuma
Eric S. Téllez
author_facet Mario Graff
Daniela Moctezuma
Eric S. Téllez
author_sort Mario Graff
collection DOAJ
description The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architectures for various natural language processing tasks, including text classification problems. For a newcomer, it might be impossible to realize that for some text classification problems, the traditional approach is still competitive. This research analyzes the competitiveness of BoW-based representations in different text-classification competitions run in English, Spanish, and Italian. To analyze the performance of these BoW-based representations, we participated in 12 text classification international competitions, summing up 24 tasks comprising five English tasks, seven in Italian, and twelve in Spanish. The results show that the proposed BoW representations have a difference of just 10% w.r.t. the competition winner and less than 2% in three tasks corresponding to author profiling. BoW outperforms BERT solutions and dominates in author profiling tasks.
format Article
id doaj-art-28af9c34abd54ef0bbd067bbe2ce6170
institution OA Journals
issn 2949-7191
language English
publishDate 2025-06-01
publisher Elsevier
record_format Article
series Natural Language Processing Journal
spelling doaj-art-28af9c34abd54ef0bbd067bbe2ce61702025-08-20T02:07:31ZengElsevierNatural Language Processing Journal2949-71912025-06-011110015410.1016/j.nlp.2025.100154Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challengesMario Graff0Daniela Moctezuma1Eric S. Téllez2Consejo Nacional de Humanidades, Ciencia y Tecnología (CONAHCYT), Ciudad de México, Mexico; INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Aguascalientes, MexicoCentroGEO Centro de Investigación en Ciencias de Información Geoespacial, Aguascalientes, MexicoConsejo Nacional de Humanidades, Ciencia y Tecnología (CONAHCYT), Ciudad de México, Mexico; INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Aguascalientes, Mexico; Correspondence to: Circuito Tecnopolo Sur No. 112, Fraccionamiento Tecnopolo Pocitos, C.P. 20313, Aguascalientes, Ags., Mexico.The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architectures for various natural language processing tasks, including text classification problems. For a newcomer, it might be impossible to realize that for some text classification problems, the traditional approach is still competitive. This research analyzes the competitiveness of BoW-based representations in different text-classification competitions run in English, Spanish, and Italian. To analyze the performance of these BoW-based representations, we participated in 12 text classification international competitions, summing up 24 tasks comprising five English tasks, seven in Italian, and twelve in Spanish. The results show that the proposed BoW representations have a difference of just 10% w.r.t. the competition winner and less than 2% in three tasks corresponding to author profiling. BoW outperforms BERT solutions and dominates in author profiling tasks.http://www.sciencedirect.com/science/article/pii/S2949719125000305Text classificationLexical and semantic Bag of WordsStack generalizationExplainable models
spellingShingle Mario Graff
Daniela Moctezuma
Eric S. Téllez
Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
Natural Language Processing Journal
Text classification
Lexical and semantic Bag of Words
Stack generalization
Explainable models
title Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
title_full Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
title_fullStr Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
title_full_unstemmed Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
title_short Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges
title_sort bag of word approach is not dead a performance analysis on a myriad of text classification challenges
topic Text classification
Lexical and semantic Bag of Words
Stack generalization
Explainable models
url http://www.sciencedirect.com/science/article/pii/S2949719125000305
work_keys_str_mv AT mariograff bagofwordapproachisnotdeadaperformanceanalysisonamyriadoftextclassificationchallenges
AT danielamoctezuma bagofwordapproachisnotdeadaperformanceanalysisonamyriadoftextclassificationchallenges
AT ericstellez bagofwordapproachisnotdeadaperformanceanalysisonamyriadoftextclassificationchallenges