Ensemble Transformer–Based Detection of Fake and AI–Generated News

The proliferation of fake online and AI–generated news content poses a significant threat to information integrity. This work leverages advanced natural language processing, machine learning, and deep learning algorithms to effectively detect fake and AI–generated content. The utilized dataset, comb...

Full description

Saved in:
Bibliographic Details
Main Authors: Md. Ishraquzzaman, Mohammed Ashraful Islam Chowdhury, Shahreen Rahman, Riasat Khan
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:Applied Computational Intelligence and Soft Computing
Online Access:http://dx.doi.org/10.1155/acis/3268456
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849716687326150656
author Md. Ishraquzzaman
Mohammed Ashraful Islam Chowdhury
Shahreen Rahman
Riasat Khan
author_facet Md. Ishraquzzaman
Mohammed Ashraful Islam Chowdhury
Shahreen Rahman
Riasat Khan
author_sort Md. Ishraquzzaman
collection DOAJ
description The proliferation of fake online and AI–generated news content poses a significant threat to information integrity. This work leverages advanced natural language processing, machine learning, and deep learning algorithms to effectively detect fake and AI–generated content. The utilized dataset, combined with multiple open-source datasets, comprises 43,000 real, 31,000 fake, and 80,000 AI–generated news articles and is augmented with an ensemble large language model. We combined three open-source LLMs (GPT-2, GPT-NEO, and Distil-GPT-2) into an ensemble LLM to generate new news titles, selecting the best outputs through majority voting for further dataset expansion. Preprocessing involved data cleaning, lowercasing, stop word removal, tokenization, and lemmatization. We applied six machine learning and five natural language processing models to this dataset. The two top-performing natural language–based models (RoBERTa and DeBERTa) have been combined to develop an ensemble transformer model. Among the machine learning models, random forest achieved the highest performance, with an accuracy of 92.49% and an F1 score of 92.60%. Among the natural language processing models, the ensemble transformer model attained the highest results, with 96.65% accuracy and an F1 score of 96.66%. The proposed ensemble model is optimized by applying model pruning (reducing parameters from 265M to 210M, improving training time by 25%) and dynamic quantization (reducing model size by 50%, maintaining 95.68% accuracy), enhancing scalability and efficiency while minimizing computational overhead. The DistilBERT-Student model, trained using a balanced combination of feature- and logit-based distillation from the RoBERTa-base Teacher network, achieved strong classification performance with 96.17% accuracy. Visualize-based attention maps are constructed for different news categories to enhance the interpretability of the applied transformer–based ensemble news detection models. Finally, a website was developed to enable users to identify fake, real, or AI–generated news content. The employed dataset, including AI–generated news articles and implementation scripts, can be found at the following website: https://github.com/ishraqisheree99/Combined-News-Dataset.git.
format Article
id doaj-art-7a1d22e4b47d4eb5b70ae60e7519c6b2
institution DOAJ
issn 1687-9732
language English
publishDate 2025-01-01
publisher Wiley
record_format Article
series Applied Computational Intelligence and Soft Computing
spelling doaj-art-7a1d22e4b47d4eb5b70ae60e7519c6b22025-08-20T03:12:54ZengWileyApplied Computational Intelligence and Soft Computing1687-97322025-01-01202510.1155/acis/3268456Ensemble Transformer–Based Detection of Fake and AI–Generated NewsMd. Ishraquzzaman0Mohammed Ashraful Islam Chowdhury1Shahreen Rahman2Riasat Khan3Electrical and Computer EngineeringElectrical and Computer EngineeringElectrical and Computer EngineeringElectrical and Computer EngineeringThe proliferation of fake online and AI–generated news content poses a significant threat to information integrity. This work leverages advanced natural language processing, machine learning, and deep learning algorithms to effectively detect fake and AI–generated content. The utilized dataset, combined with multiple open-source datasets, comprises 43,000 real, 31,000 fake, and 80,000 AI–generated news articles and is augmented with an ensemble large language model. We combined three open-source LLMs (GPT-2, GPT-NEO, and Distil-GPT-2) into an ensemble LLM to generate new news titles, selecting the best outputs through majority voting for further dataset expansion. Preprocessing involved data cleaning, lowercasing, stop word removal, tokenization, and lemmatization. We applied six machine learning and five natural language processing models to this dataset. The two top-performing natural language–based models (RoBERTa and DeBERTa) have been combined to develop an ensemble transformer model. Among the machine learning models, random forest achieved the highest performance, with an accuracy of 92.49% and an F1 score of 92.60%. Among the natural language processing models, the ensemble transformer model attained the highest results, with 96.65% accuracy and an F1 score of 96.66%. The proposed ensemble model is optimized by applying model pruning (reducing parameters from 265M to 210M, improving training time by 25%) and dynamic quantization (reducing model size by 50%, maintaining 95.68% accuracy), enhancing scalability and efficiency while minimizing computational overhead. The DistilBERT-Student model, trained using a balanced combination of feature- and logit-based distillation from the RoBERTa-base Teacher network, achieved strong classification performance with 96.17% accuracy. Visualize-based attention maps are constructed for different news categories to enhance the interpretability of the applied transformer–based ensemble news detection models. Finally, a website was developed to enable users to identify fake, real, or AI–generated news content. The employed dataset, including AI–generated news articles and implementation scripts, can be found at the following website: https://github.com/ishraqisheree99/Combined-News-Dataset.git.http://dx.doi.org/10.1155/acis/3268456
spellingShingle Md. Ishraquzzaman
Mohammed Ashraful Islam Chowdhury
Shahreen Rahman
Riasat Khan
Ensemble Transformer–Based Detection of Fake and AI–Generated News
Applied Computational Intelligence and Soft Computing
title Ensemble Transformer–Based Detection of Fake and AI–Generated News
title_full Ensemble Transformer–Based Detection of Fake and AI–Generated News
title_fullStr Ensemble Transformer–Based Detection of Fake and AI–Generated News
title_full_unstemmed Ensemble Transformer–Based Detection of Fake and AI–Generated News
title_short Ensemble Transformer–Based Detection of Fake and AI–Generated News
title_sort ensemble transformer based detection of fake and ai generated news
url http://dx.doi.org/10.1155/acis/3268456
work_keys_str_mv AT mdishraquzzaman ensembletransformerbaseddetectionoffakeandaigeneratednews
AT mohammedashrafulislamchowdhury ensembletransformerbaseddetectionoffakeandaigeneratednews
AT shahreenrahman ensembletransformerbaseddetectionoffakeandaigeneratednews
AT riasatkhan ensembletransformerbaseddetectionoffakeandaigeneratednews