Spot the bot: the inverse problems of NLP

This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this i...

Full description

Saved in:
Bibliographic Details
Main Authors: Vasilii A. Gromov, Quynh Nhu Dang, Alexandra S. Kogan, Assel Yerbolova
Format: Article
Language:English
Published: PeerJ Inc. 2024-12-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2550.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850247283744964608
author Vasilii A. Gromov
Quynh Nhu Dang
Alexandra S. Kogan
Assel Yerbolova
author_facet Vasilii A. Gromov
Quynh Nhu Dang
Alexandra S. Kogan
Assel Yerbolova
author_sort Vasilii A. Gromov
collection DOAJ
description This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.
format Article
id doaj-art-e6dd9bfb28444520ae8ba76458d13ac8
institution OA Journals
issn 2376-5992
language English
publishDate 2024-12-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-e6dd9bfb28444520ae8ba76458d13ac82025-08-20T01:58:59ZengPeerJ Inc.PeerJ Computer Science2376-59922024-12-0110e255010.7717/peerj-cs.2550Spot the bot: the inverse problems of NLPVasilii A. Gromov0Quynh Nhu Dang1Alexandra S. Kogan2Assel Yerbolova3HSE University, Moscow, RussiaHSE University, Moscow, RussiaHSE University, Moscow, RussiaHSE University, Moscow, RussiaThis article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.https://peerj.com/articles/cs-2550.pdfBot detectionNLPInverse problemsClusteringStrange attractors
spellingShingle Vasilii A. Gromov
Quynh Nhu Dang
Alexandra S. Kogan
Assel Yerbolova
Spot the bot: the inverse problems of NLP
PeerJ Computer Science
Bot detection
NLP
Inverse problems
Clustering
Strange attractors
title Spot the bot: the inverse problems of NLP
title_full Spot the bot: the inverse problems of NLP
title_fullStr Spot the bot: the inverse problems of NLP
title_full_unstemmed Spot the bot: the inverse problems of NLP
title_short Spot the bot: the inverse problems of NLP
title_sort spot the bot the inverse problems of nlp
topic Bot detection
NLP
Inverse problems
Clustering
Strange attractors
url https://peerj.com/articles/cs-2550.pdf
work_keys_str_mv AT vasiliiagromov spotthebottheinverseproblemsofnlp
AT quynhnhudang spotthebottheinverseproblemsofnlp
AT alexandraskogan spotthebottheinverseproblemsofnlp
AT asselyerbolova spotthebottheinverseproblemsofnlp