CLFormer: a cross-lingual transformer framework for temporal forgery localization

Abstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Haonan Cheng, Hanyue Liu, Juanjuan Cai, Long Ye
Format:	Article
Language:	English
Published:	Springer 2025-07-01
Series:	Visual Intelligence
Subjects:	Temporal forgery localization (TFL) Cross-lingual Audio feature Wav2Vec2 Boundary refinement
Online Access:	https://doi.org/10.1007/s44267-025-00084-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849235300521345024
author	Haonan Cheng Hanyue Liu Juanjuan Cai Long Ye
author_facet	Haonan Cheng Hanyue Liu Juanjuan Cai Long Ye
author_sort	Haonan Cheng
collection	DOAJ
description	Abstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.
format	Article
id	doaj-art-5eb3df1f5ce54294b2f870fd003e73b2
institution	Kabale University
issn	2097-3330 2731-9008
language	English
publishDate	2025-07-01
publisher	Springer
record_format	Article
series	Visual Intelligence
spelling	doaj-art-5eb3df1f5ce54294b2f870fd003e73b22025-08-20T04:02:50ZengSpringerVisual Intelligence2097-33302731-90082025-07-013111310.1007/s44267-025-00084-zCLFormer: a cross-lingual transformer framework for temporal forgery localizationHaonan Cheng0Hanyue Liu1Juanjuan Cai2Long Ye3State Key Laboratory of Media Convergence and Communication, Communication University of ChinaSchool of Information and Communication Engineering, Communication University of ChinaKey Laboratory of Media Audio & Video (Communication University of China), Ministry of Education, Communication University of ChinaState Key Laboratory of Media Convergence and Communication, Communication University of ChinaAbstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.https://doi.org/10.1007/s44267-025-00084-zTemporal forgery localization (TFL)Cross-lingualAudio featureWav2Vec2Boundary refinement
spellingShingle	Haonan Cheng Hanyue Liu Juanjuan Cai Long Ye CLFormer: a cross-lingual transformer framework for temporal forgery localization Visual Intelligence Temporal forgery localization (TFL) Cross-lingual Audio feature Wav2Vec2 Boundary refinement
title	CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_full	CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_fullStr	CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_full_unstemmed	CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_short	CLFormer: a cross-lingual transformer framework for temporal forgery localization
title_sort	clformer a cross lingual transformer framework for temporal forgery localization
topic	Temporal forgery localization (TFL) Cross-lingual Audio feature Wav2Vec2 Boundary refinement
url	https://doi.org/10.1007/s44267-025-00084-z
work_keys_str_mv	AT haonancheng clformeracrosslingualtransformerframeworkfortemporalforgerylocalization AT hanyueliu clformeracrosslingualtransformerframeworkfortemporalforgerylocalization AT juanjuancai clformeracrosslingualtransformerframeworkfortemporalforgerylocalization AT longye clformeracrosslingualtransformerframeworkfortemporalforgerylocalization

CLFormer: a cross-lingual transformer framework for temporal forgery localization

Similar Items