Integrated Technique of Natural Language Texts and Source Codes Authorship Verification in the Academic Environment

The issue of text plagiarism in academic and educational environments is becoming increasingly relevant every year. The quality of research articles and works is declining due to students copying fragments of others’ works and using modern generative models for text and source code creati...

Full description

Saved in:
Bibliographic Details
Main Authors: Aleksandr Romanov, Anna Kurtukova, Anastasiia Fedotova, Alexander Shelupanov
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11059894/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The issue of text plagiarism in academic and educational environments is becoming increasingly relevant every year. The quality of research articles and works is declining due to students copying fragments of others’ works and using modern generative models for text and source code creation. The article proposes an integrated technique for authorship verification of both natural and programming language texts, based on a combination of statistical methods, machine learning, and deep neural networks. The presented technique addresses several related tasks: assessing text homogeneity, detecting plagiarism when solving closed-set authorship attribution problems, and identifying texts and fragments created by generative models. Experimental data include a multi-domain dataset of natural language texts consisting of research articles on natural and technical sciences, PhD dissertations, and artificially generated samples on related topics. To evaluate the effectiveness of the technique in relation to programming language texts, a multilingual program dataset was used, consisting of source codes for programs of technical students as well as artificially generated program codes. The experimental results demonstrate the effectiveness of the proposed technique for plagiarism detection and copyright protection in the educational process. The accuracy of identifying heterogeneous fragments in text or code is 93-94%, authorship attribution ac-curacy is 89-99% depending on the number of co-authors, and verification accuracy is 97.5-99.4%.
ISSN:2169-3536