TriPlaNet: Enhancing machine-paraphrasing plagiarism detection through triplet network and contrastive learning
Powerful large language models (LLMs) have generated and paraphrased texts that are difficult for humans to distinguish from human-authored texts, sparking concerns about their potential misuse. Previous studies on detecting LLM-paraphrased texts have either proposed ineffective solutions and/or fai...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-09-01
|
| Series: | Egyptian Informatics Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1110866525001458 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Powerful large language models (LLMs) have generated and paraphrased texts that are difficult for humans to distinguish from human-authored texts, sparking concerns about their potential misuse. Previous studies on detecting LLM-paraphrased texts have either proposed ineffective solutions and/or failed to consider academic texts. To address these challenges, we propose a novel plagiarism detection framework called Triplet Plagiarism Network (TriPlaNet). The proposed framework combines three distinct Style Representation Transformers for Authorship (SRTA), each with its own set of parameters, and a few-shot classifier. These three SRTA encoders operate independently during contrastive training to capture nuanced variations in writing style. Our approach reframes plagiarism detection as an authorship attribution problem. To diversify the dataset, we demonstrate fine-tuning of 11b parameters T5 XXL model with Low-Rank Adaptation using a large-scale (more than 200k) plagiarism dataset to construct a controlled plagiarizer whereby proposing a new additional dataset. TriPlaNet demonstrated superior performance over existing models when tested on two datasets. The F1 scores on the two datasets were 99.37% and 99.48%, respectively. TriPlaNet also demonstrates robust performance in plagiarism detection across cross-dataset evaluations. The F1 scores remained above 80.50% and 81.49% on both datasets. |
|---|---|
| ISSN: | 1110-8665 |