CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS

Code clone detection underpins a wide range of maintenance tasks from automated refactoring to real time plagiarism policing yet single view methods that rely on raw tokens, Abstract Syntax Trees or Control Flow Graphs still struggle to type-4 (semantic) clones. We present DG Clone, a dual graph se...

Full description

Saved in:
Bibliographic Details
Main Authors: Chunguang Li, Adisorn Sirikham, Jessada Konpang, Yan Wang
Format: Article
Language:English
Published: Regional Association for Security and crisis management, Belgrade, Serbia 2024-12-01
Series:Operational Research in Engineering Sciences: Theory and Applications
Subjects:
Online Access:https://oresta.org/menu-script/index.php/oresta/article/view/845
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849432308762804224
author Chunguang Li
Adisorn Sirikham
Jessada Konpang
Yan Wang
author_facet Chunguang Li
Adisorn Sirikham
Jessada Konpang
Yan Wang
author_sort Chunguang Li
collection DOAJ
description Code clone detection underpins a wide range of maintenance tasks from automated refactoring to real time plagiarism policing yet single view methods that rely on raw tokens, Abstract Syntax Trees or Control Flow Graphs still struggle to type-4 (semantic) clones. We present DG Clone, a dual graph self-supervised framework that couples a textual call dependency graph with an AS derived structural graph and fuses them through a lightweight cross graph attention module implemented in PyTorch~2.2 and PyTorch Geometric~2.5. The textual graph excels at capturing lexical context, while the AST graph models hierarchical syntax. Their fusion recovers semantic equivalence that each view alone misses, outperforming token sequence models (e.g., GraphCodeBERT) and single graph GNNs. Training employs a graph aware triplet loss that obviates manual labels by dynamically constructing positive/negative triplets from unlabelled repositories. DG Clone boosts F1 on BigCloneBench from 90.3% to 98.4% and on Google Code Jam from 81.7% to 89.8%. It lifts MAP by +6.8pp over Tree Based CNNs and +5.4pp over GraphCodeBERT, while cutting inference latency in an online judge by 31% in Python implementation. These findings demonstrate that integrating complementary graph views affords a label-efficient and practically viable route to uncovering subtle semantic similarities in source code.
format Article
id doaj-art-b6904b904f7848bb8c149649ff9d4e0a
institution Kabale University
issn 2620-1607
2620-1747
language English
publishDate 2024-12-01
publisher Regional Association for Security and crisis management, Belgrade, Serbia
record_format Article
series Operational Research in Engineering Sciences: Theory and Applications
spelling doaj-art-b6904b904f7848bb8c149649ff9d4e0a2025-08-20T03:27:23ZengRegional Association for Security and crisis management, Belgrade, SerbiaOperational Research in Engineering Sciences: Theory and Applications2620-16072620-17472024-12-0174CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHSChunguang Li0Adisorn Sirikham1Jessada Konpang2Yan Wang3Faculty of Engineering, Rajamangala University of Technology Krungthep, Bangkok, Thailand, 10120Faculty of Engineering, Rajamangala University of Technology Krungthep, Bangkok, Thailand, 10120Faculty of Engineering, Rajamangala University of Technology Krungthep, Bangkok, Thailand, 10120Jiangsu College of Finance and Accounting, Lianyungang, China, 222061 Code clone detection underpins a wide range of maintenance tasks from automated refactoring to real time plagiarism policing yet single view methods that rely on raw tokens, Abstract Syntax Trees or Control Flow Graphs still struggle to type-4 (semantic) clones. We present DG Clone, a dual graph self-supervised framework that couples a textual call dependency graph with an AS derived structural graph and fuses them through a lightweight cross graph attention module implemented in PyTorch~2.2 and PyTorch Geometric~2.5. The textual graph excels at capturing lexical context, while the AST graph models hierarchical syntax. Their fusion recovers semantic equivalence that each view alone misses, outperforming token sequence models (e.g., GraphCodeBERT) and single graph GNNs. Training employs a graph aware triplet loss that obviates manual labels by dynamically constructing positive/negative triplets from unlabelled repositories. DG Clone boosts F1 on BigCloneBench from 90.3% to 98.4% and on Google Code Jam from 81.7% to 89.8%. It lifts MAP by +6.8pp over Tree Based CNNs and +5.4pp over GraphCodeBERT, while cutting inference latency in an online judge by 31% in Python implementation. These findings demonstrate that integrating complementary graph views affords a label-efficient and practically viable route to uncovering subtle semantic similarities in source code. https://oresta.org/menu-script/index.php/oresta/article/view/845Code Clone DetectionSelf-Supervised LearningGraph Neural Network
spellingShingle Chunguang Li
Adisorn Sirikham
Jessada Konpang
Yan Wang
CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS
Operational Research in Engineering Sciences: Theory and Applications
Code Clone Detection
Self-Supervised Learning
Graph Neural Network
title CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS
title_full CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS
title_fullStr CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS
title_full_unstemmed CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS
title_short CODE CLONE DETECTION WITH SELF-SUPERVISION ON DUAL GRAPHS
title_sort code clone detection with self supervision on dual graphs
topic Code Clone Detection
Self-Supervised Learning
Graph Neural Network
url https://oresta.org/menu-script/index.php/oresta/article/view/845
work_keys_str_mv AT chunguangli codeclonedetectionwithselfsupervisionondualgraphs
AT adisornsirikham codeclonedetectionwithselfsupervisionondualgraphs
AT jessadakonpang codeclonedetectionwithselfsupervisionondualgraphs
AT yanwang codeclonedetectionwithselfsupervisionondualgraphs