Spot the bot: the inverse problems of NLP

This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this i...

Full description

Saved in:

Bibliographic Details
Main Authors:	Vasilii A. Gromov, Quynh Nhu Dang, Alexandra S. Kogan, Assel Yerbolova
Format:	Article
Language:	English
Published:	PeerJ Inc. 2024-12-01
Series:	PeerJ Computer Science
Subjects:	Bot detection NLP Inverse problems Clustering Strange attractors
Online Access:	https://peerj.com/articles/cs-2550.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850247283744964608
author	Vasilii A. Gromov Quynh Nhu Dang Alexandra S. Kogan Assel Yerbolova
author_facet	Vasilii A. Gromov Quynh Nhu Dang Alexandra S. Kogan Assel Yerbolova
author_sort	Vasilii A. Gromov
collection	DOAJ
description	This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.
format	Article
id	doaj-art-e6dd9bfb28444520ae8ba76458d13ac8
institution	OA Journals
issn	2376-5992
language	English
publishDate	2024-12-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj-art-e6dd9bfb28444520ae8ba76458d13ac82025-08-20T01:58:59ZengPeerJ Inc.PeerJ Computer Science2376-59922024-12-0110e255010.7717/peerj-cs.2550Spot the bot: the inverse problems of NLPVasilii A. Gromov0Quynh Nhu Dang1Alexandra S. Kogan2Assel Yerbolova3HSE University, Moscow, RussiaHSE University, Moscow, RussiaHSE University, Moscow, RussiaHSE University, Moscow, RussiaThis article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.https://peerj.com/articles/cs-2550.pdfBot detectionNLPInverse problemsClusteringStrange attractors
spellingShingle	Vasilii A. Gromov Quynh Nhu Dang Alexandra S. Kogan Assel Yerbolova Spot the bot: the inverse problems of NLP PeerJ Computer Science Bot detection NLP Inverse problems Clustering Strange attractors
title	Spot the bot: the inverse problems of NLP
title_full	Spot the bot: the inverse problems of NLP
title_fullStr	Spot the bot: the inverse problems of NLP
title_full_unstemmed	Spot the bot: the inverse problems of NLP
title_short	Spot the bot: the inverse problems of NLP
title_sort	spot the bot the inverse problems of nlp
topic	Bot detection NLP Inverse problems Clustering Strange attractors
url	https://peerj.com/articles/cs-2550.pdf
work_keys_str_mv	AT vasiliiagromov spotthebottheinverseproblemsofnlp AT quynhnhudang spotthebottheinverseproblemsofnlp AT alexandraskogan spotthebottheinverseproblemsofnlp AT asselyerbolova spotthebottheinverseproblemsofnlp

Spot the bot: the inverse problems of NLP

Similar Items