Spot the bot: the inverse problems of NLP
This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this i...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
PeerJ Inc.
2024-12-01
|
| Series: | PeerJ Computer Science |
| Subjects: | |
| Online Access: | https://peerj.com/articles/cs-2550.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850247283744964608 |
|---|---|
| author | Vasilii A. Gromov Quynh Nhu Dang Alexandra S. Kogan Assel Yerbolova |
| author_facet | Vasilii A. Gromov Quynh Nhu Dang Alexandra S. Kogan Assel Yerbolova |
| author_sort | Vasilii A. Gromov |
| collection | DOAJ |
| description | This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families. |
| format | Article |
| id | doaj-art-e6dd9bfb28444520ae8ba76458d13ac8 |
| institution | OA Journals |
| issn | 2376-5992 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | PeerJ Inc. |
| record_format | Article |
| series | PeerJ Computer Science |
| spelling | doaj-art-e6dd9bfb28444520ae8ba76458d13ac82025-08-20T01:58:59ZengPeerJ Inc.PeerJ Computer Science2376-59922024-12-0110e255010.7717/peerj-cs.2550Spot the bot: the inverse problems of NLPVasilii A. Gromov0Quynh Nhu Dang1Alexandra S. Kogan2Assel Yerbolova3HSE University, Moscow, RussiaHSE University, Moscow, RussiaHSE University, Moscow, RussiaHSE University, Moscow, RussiaThis article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.https://peerj.com/articles/cs-2550.pdfBot detectionNLPInverse problemsClusteringStrange attractors |
| spellingShingle | Vasilii A. Gromov Quynh Nhu Dang Alexandra S. Kogan Assel Yerbolova Spot the bot: the inverse problems of NLP PeerJ Computer Science Bot detection NLP Inverse problems Clustering Strange attractors |
| title | Spot the bot: the inverse problems of NLP |
| title_full | Spot the bot: the inverse problems of NLP |
| title_fullStr | Spot the bot: the inverse problems of NLP |
| title_full_unstemmed | Spot the bot: the inverse problems of NLP |
| title_short | Spot the bot: the inverse problems of NLP |
| title_sort | spot the bot the inverse problems of nlp |
| topic | Bot detection NLP Inverse problems Clustering Strange attractors |
| url | https://peerj.com/articles/cs-2550.pdf |
| work_keys_str_mv | AT vasiliiagromov spotthebottheinverseproblemsofnlp AT quynhnhudang spotthebottheinverseproblemsofnlp AT alexandraskogan spotthebottheinverseproblemsofnlp AT asselyerbolova spotthebottheinverseproblemsofnlp |