Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models

The task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic str...

Full description

Saved in:

Bibliographic Details
Main Authors:	Abdessamad Benlahbib, Achraf Boumhidi, Anass Fahfouh, Hamza Alami
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Open Journal of the Computer Society
Subjects:	Large language models (LLMs) linguistic acceptability natural language processing
Online Access:	https://ieeexplore.ieee.org/document/10829978/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832583267380887552
author	Abdessamad Benlahbib Achraf Boumhidi Anass Fahfouh Hamza Alami
author_facet	Abdessamad Benlahbib Achraf Boumhidi Anass Fahfouh Hamza Alami
author_sort	Abdessamad Benlahbib
collection	DOAJ
description	The task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic structures, and detecting semantic inconsistencies, all of which make the task difficult even for human annotators. In this article, we compare a range of techniques, from traditional methods such as Part-of-Speech (POS) tagging and feature extraction methods like CountVectorizer with Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, to modern embeddings such as FastText and Embeddings from Language Models (ELMo), as well as deep learning architectures like transformers and Large Language Models (LLMs). Our experiments show a clear improvement in performance as models evolve from traditional to more advanced approaches. Notably, state-of-the-art (SOTA) results were obtained by fine-tuning GPT-4o with extensive hyperparameter tuning, including experimenting with various epochs and batch sizes. This comparative analysis provides valuable insights into the relative strengths of each technique for identifying morphological, syntactic, and semantic violations, highlighting the effectiveness of LLMs in these tasks.
format	Article
id	doaj-art-f0c226bc8b2b439ebd40eba10d19d977
institution	Kabale University
issn	2644-1268
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of the Computer Society
spelling	doaj-art-f0c226bc8b2b439ebd40eba10d19d9772025-01-29T00:01:24ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-01624826010.1109/OJCS.2025.352671210829978Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language ModelsAbdessamad Benlahbib0https://orcid.org/0000-0002-0039-7832Achraf Boumhidi1https://orcid.org/0000-0002-7396-9651Anass Fahfouh2https://orcid.org/0000-0001-5793-0725Hamza Alami3https://orcid.org/0000-0001-6945-6098Computer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoThe task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic structures, and detecting semantic inconsistencies, all of which make the task difficult even for human annotators. In this article, we compare a range of techniques, from traditional methods such as Part-of-Speech (POS) tagging and feature extraction methods like CountVectorizer with Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, to modern embeddings such as FastText and Embeddings from Language Models (ELMo), as well as deep learning architectures like transformers and Large Language Models (LLMs). Our experiments show a clear improvement in performance as models evolve from traditional to more advanced approaches. Notably, state-of-the-art (SOTA) results were obtained by fine-tuning GPT-4o with extensive hyperparameter tuning, including experimenting with various epochs and batch sizes. This comparative analysis provides valuable insights into the relative strengths of each technique for identifying morphological, syntactic, and semantic violations, highlighting the effectiveness of LLMs in these tasks.https://ieeexplore.ieee.org/document/10829978/Large language models (LLMs)linguistic acceptabilitynatural language processing
spellingShingle	Abdessamad Benlahbib Achraf Boumhidi Anass Fahfouh Hamza Alami Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models IEEE Open Journal of the Computer Society Large language models (LLMs) linguistic acceptability natural language processing
title	Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_full	Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_fullStr	Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_full_unstemmed	Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_short	Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_sort	comparative analysis of traditional and modern nlp techniques on the cola dataset from pos tagging to large language models
topic	Large language models (LLMs) linguistic acceptability natural language processing
url	https://ieeexplore.ieee.org/document/10829978/
work_keys_str_mv	AT abdessamadbenlahbib comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels AT achrafboumhidi comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels AT anassfahfouh comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels AT hamzaalami comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels

Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models

Similar Items