CLFormer: a cross-lingual transformer framework for temporal forgery localization

Abstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performa...

Full description

Saved in:
Bibliographic Details
Main Authors: Haonan Cheng, Hanyue Liu, Juanjuan Cai, Long Ye
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Visual Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44267-025-00084-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849235300521345024
author Haonan Cheng
Hanyue Liu
Juanjuan Cai
Long Ye
author_facet Haonan Cheng
Hanyue Liu
Juanjuan Cai
Long Ye
author_sort Haonan Cheng
collection DOAJ
description Abstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.
format Article
id doaj-art-5eb3df1f5ce54294b2f870fd003e73b2
institution Kabale University
issn 2097-3330
2731-9008
language English
publishDate 2025-07-01
publisher Springer
record_format Article
series Visual Intelligence
spelling doaj-art-5eb3df1f5ce54294b2f870fd003e73b22025-08-20T04:02:50ZengSpringerVisual Intelligence2097-33302731-90082025-07-013111310.1007/s44267-025-00084-zCLFormer: a cross-lingual transformer framework for temporal forgery localizationHaonan Cheng0Hanyue Liu1Juanjuan Cai2Long Ye3State Key Laboratory of Media Convergence and Communication, Communication University of ChinaSchool of Information and Communication Engineering, Communication University of ChinaKey Laboratory of Media Audio & Video (Communication University of China), Ministry of Education, Communication University of ChinaState Key Laboratory of Media Convergence and Communication, Communication University of ChinaAbstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.https://doi.org/10.1007/s44267-025-00084-zTemporal forgery localization (TFL)Cross-lingualAudio featureWav2Vec2Boundary refinement
spellingShingle Haonan Cheng
Hanyue Liu
Juanjuan Cai
Long Ye
CLFormer: a cross-lingual transformer framework for temporal forgery localization
Visual Intelligence
Temporal forgery localization (TFL)
Cross-lingual
Audio feature
Wav2Vec2
Boundary refinement
title CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_full CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_fullStr CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_full_unstemmed CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_short CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_sort clformer a cross lingual transformer framework for temporal forgery localization
topic Temporal forgery localization (TFL)
Cross-lingual
Audio feature
Wav2Vec2
Boundary refinement
url https://doi.org/10.1007/s44267-025-00084-z
work_keys_str_mv AT haonancheng clformeracrosslingualtransformerframeworkfortemporalforgerylocalization
AT hanyueliu clformeracrosslingualtransformerframeworkfortemporalforgerylocalization
AT juanjuancai clformeracrosslingualtransformerframeworkfortemporalforgerylocalization
AT longye clformeracrosslingualtransformerframeworkfortemporalforgerylocalization