CLFormer: a cross-lingual transformer framework for temporal forgery localization
Abstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performa...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Visual Intelligence |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44267-025-00084-z |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849235300521345024 |
|---|---|
| author | Haonan Cheng Hanyue Liu Juanjuan Cai Long Ye |
| author_facet | Haonan Cheng Hanyue Liu Juanjuan Cai Long Ye |
| author_sort | Haonan Cheng |
| collection | DOAJ |
| description | Abstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability. |
| format | Article |
| id | doaj-art-5eb3df1f5ce54294b2f870fd003e73b2 |
| institution | Kabale University |
| issn | 2097-3330 2731-9008 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Springer |
| record_format | Article |
| series | Visual Intelligence |
| spelling | doaj-art-5eb3df1f5ce54294b2f870fd003e73b22025-08-20T04:02:50ZengSpringerVisual Intelligence2097-33302731-90082025-07-013111310.1007/s44267-025-00084-zCLFormer: a cross-lingual transformer framework for temporal forgery localizationHaonan Cheng0Hanyue Liu1Juanjuan Cai2Long Ye3State Key Laboratory of Media Convergence and Communication, Communication University of ChinaSchool of Information and Communication Engineering, Communication University of ChinaKey Laboratory of Media Audio & Video (Communication University of China), Ministry of Education, Communication University of ChinaState Key Laboratory of Media Convergence and Communication, Communication University of ChinaAbstract Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.https://doi.org/10.1007/s44267-025-00084-zTemporal forgery localization (TFL)Cross-lingualAudio featureWav2Vec2Boundary refinement |
| spellingShingle | Haonan Cheng Hanyue Liu Juanjuan Cai Long Ye CLFormer: a cross-lingual transformer framework for temporal forgery localization Visual Intelligence Temporal forgery localization (TFL) Cross-lingual Audio feature Wav2Vec2 Boundary refinement |
| title | CLFormer: a cross-lingual transformer framework for temporal forgery localization |
| title_full | CLFormer: a cross-lingual transformer framework for temporal forgery localization |
| title_fullStr | CLFormer: a cross-lingual transformer framework for temporal forgery localization |
| title_full_unstemmed | CLFormer: a cross-lingual transformer framework for temporal forgery localization |
| title_short | CLFormer: a cross-lingual transformer framework for temporal forgery localization |
| title_sort | clformer a cross lingual transformer framework for temporal forgery localization |
| topic | Temporal forgery localization (TFL) Cross-lingual Audio feature Wav2Vec2 Boundary refinement |
| url | https://doi.org/10.1007/s44267-025-00084-z |
| work_keys_str_mv | AT haonancheng clformeracrosslingualtransformerframeworkfortemporalforgerylocalization AT hanyueliu clformeracrosslingualtransformerframeworkfortemporalforgerylocalization AT juanjuancai clformeracrosslingualtransformerframeworkfortemporalforgerylocalization AT longye clformeracrosslingualtransformerframeworkfortemporalforgerylocalization |