Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models
The task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic str...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Open Journal of the Computer Society |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10829978/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832583267380887552 |
---|---|
author | Abdessamad Benlahbib Achraf Boumhidi Anass Fahfouh Hamza Alami |
author_facet | Abdessamad Benlahbib Achraf Boumhidi Anass Fahfouh Hamza Alami |
author_sort | Abdessamad Benlahbib |
collection | DOAJ |
description | The task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic structures, and detecting semantic inconsistencies, all of which make the task difficult even for human annotators. In this article, we compare a range of techniques, from traditional methods such as Part-of-Speech (POS) tagging and feature extraction methods like CountVectorizer with Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, to modern embeddings such as FastText and Embeddings from Language Models (ELMo), as well as deep learning architectures like transformers and Large Language Models (LLMs). Our experiments show a clear improvement in performance as models evolve from traditional to more advanced approaches. Notably, state-of-the-art (SOTA) results were obtained by fine-tuning GPT-4o with extensive hyperparameter tuning, including experimenting with various epochs and batch sizes. This comparative analysis provides valuable insights into the relative strengths of each technique for identifying morphological, syntactic, and semantic violations, highlighting the effectiveness of LLMs in these tasks. |
format | Article |
id | doaj-art-f0c226bc8b2b439ebd40eba10d19d977 |
institution | Kabale University |
issn | 2644-1268 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Open Journal of the Computer Society |
spelling | doaj-art-f0c226bc8b2b439ebd40eba10d19d9772025-01-29T00:01:24ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-01624826010.1109/OJCS.2025.352671210829978Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language ModelsAbdessamad Benlahbib0https://orcid.org/0000-0002-0039-7832Achraf Boumhidi1https://orcid.org/0000-0002-7396-9651Anass Fahfouh2https://orcid.org/0000-0001-5793-0725Hamza Alami3https://orcid.org/0000-0001-6945-6098Computer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoComputer Science Department, LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz (F.S.D.M), Sidi Mohamed Ben Abdellah University, Fez, MoroccoThe task of classifying linguistic acceptability, exemplified by the CoLA (Corpus of Linguistic Acceptability) dataset, poses unique challenges for natural language processing (NLP) models. These challenges include distinguishing between subtle grammatical errors, understanding complex syntactic structures, and detecting semantic inconsistencies, all of which make the task difficult even for human annotators. In this article, we compare a range of techniques, from traditional methods such as Part-of-Speech (POS) tagging and feature extraction methods like CountVectorizer with Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, to modern embeddings such as FastText and Embeddings from Language Models (ELMo), as well as deep learning architectures like transformers and Large Language Models (LLMs). Our experiments show a clear improvement in performance as models evolve from traditional to more advanced approaches. Notably, state-of-the-art (SOTA) results were obtained by fine-tuning GPT-4o with extensive hyperparameter tuning, including experimenting with various epochs and batch sizes. This comparative analysis provides valuable insights into the relative strengths of each technique for identifying morphological, syntactic, and semantic violations, highlighting the effectiveness of LLMs in these tasks.https://ieeexplore.ieee.org/document/10829978/Large language models (LLMs)linguistic acceptabilitynatural language processing |
spellingShingle | Abdessamad Benlahbib Achraf Boumhidi Anass Fahfouh Hamza Alami Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models IEEE Open Journal of the Computer Society Large language models (LLMs) linguistic acceptability natural language processing |
title | Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models |
title_full | Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models |
title_fullStr | Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models |
title_full_unstemmed | Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models |
title_short | Comparative Analysis of Traditional and Modern NLP Techniques on the CoLA Dataset: From POS Tagging to Large Language Models |
title_sort | comparative analysis of traditional and modern nlp techniques on the cola dataset from pos tagging to large language models |
topic | Large language models (LLMs) linguistic acceptability natural language processing |
url | https://ieeexplore.ieee.org/document/10829978/ |
work_keys_str_mv | AT abdessamadbenlahbib comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels AT achrafboumhidi comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels AT anassfahfouh comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels AT hamzaalami comparativeanalysisoftraditionalandmodernnlptechniquesonthecoladatasetfrompostaggingtolargelanguagemodels |