Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models

The task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic str...

Full description

Saved in:
Bibliographic Details
Main Authors: Abdessamad Benlahbib, Achraf Boumhidi, Anass Fahfouh, Hamza Alami
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Open Journal of the Computer Society
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10829978/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832583267380887552
author Abdessamad Benlahbib
Achraf Boumhidi
Anass Fahfouh
Hamza Alami
author_facet Abdessamad Benlahbib
Achraf Boumhidi
Anass Fahfouh
Hamza Alami
author_sort Abdessamad Benlahbib
collection DOAJ
description The task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic structures, and detecting semantic inconsistencies, all of which make the task difficult even for human annotators. In this article, we compare a range of techniques, from traditional methods such as Part-of-Speech (POS) tagging and feature extraction methods like CountVectorizer with Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, to modern embeddings such as FastText and Embeddings from Language Models (ELMo), as well as deep learning architectures like transformers and Large Language Models (LLMs). Our experiments show a clear improvement in performance as models evolve from traditional to more advanced approaches. Notably, state-of-the-art (SOTA) results were obtained by fine-tuning GPT-4o with extensive hyperparameter tuning, including experimenting with various epochs and batch sizes. This comparative analysis provides valuable insights into the relative strengths of each technique for identifying morphological, syntactic, and semantic violations, highlighting the effectiveness of LLMs in these tasks.
format Article
id doaj-art-f0c226bc8b2b439ebd40eba10d19d977
institution Kabale University
issn 2644-1268
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of the Computer Society
spelling doaj-art-f0c226bc8b2b439ebd40eba10d19d9772025-01-29T00:01:24ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-01624826010.1109/OJCS.2025.352671210829978Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language ModelsAbdessamad Benlahbib0https://orcid.org/0000-0002-0039-7832Achraf Boumhidi1https://orcid.org/0000-0002-7396-9651Anass Fahfouh2https://orcid.org/0000-0001-5793-0725Hamza Alami3https://orcid.org/0000-0001-6945-6098Computer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoThe task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic structures, and detecting semantic inconsistencies, all of which make the task difficult even for human annotators. In this article, we compare a range of techniques, from traditional methods such as Part-of-Speech (POS) tagging and feature extraction methods like CountVectorizer with Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, to modern embeddings such as FastText and Embeddings from Language Models (ELMo), as well as deep learning architectures like transformers and Large Language Models (LLMs). Our experiments show a clear improvement in performance as models evolve from traditional to more advanced approaches. Notably, state-of-the-art (SOTA) results were obtained by fine-tuning GPT-4o with extensive hyperparameter tuning, including experimenting with various epochs and batch sizes. This comparative analysis provides valuable insights into the relative strengths of each technique for identifying morphological, syntactic, and semantic violations, highlighting the effectiveness of LLMs in these tasks.https://ieeexplore.ieee.org/document/10829978/Large language models (LLMs)linguistic acceptabilitynatural language processing
spellingShingle Abdessamad Benlahbib
Achraf Boumhidi
Anass Fahfouh
Hamza Alami
Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
IEEE Open Journal of the Computer Society
Large language models (LLMs)
linguistic acceptability
natural language processing
title Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_full Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_fullStr Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_full_unstemmed Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_short Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
title_sort comparative analysis of traditional and modern nlp techniques on the cola dataset from pos tagging to large language models
topic Large language models (LLMs)
linguistic acceptability
natural language processing
url https://ieeexplore.ieee.org/document/10829978/
work_keys_str_mv AT abdessamadbenlahbib comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels
AT achrafboumhidi comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels
AT anassfahfouh comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels
AT hamzaalami comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels