VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
Video Super-Resolution (VSR) aims to reconstruct high-quality high-resolution (HR) videos from low-resolution (LR) inputs. Recent studies have explored diffusion models (DMs) for VSR by exploiting their generative priors to produce realistic details. However, the inherent randomness of diffusion mod...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10840194/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832590356522205184 |
---|---|
author | Linlin Liu Lele Niu Jun Tang Yong Ding |
author_facet | Linlin Liu Lele Niu Jun Tang Yong Ding |
author_sort | Linlin Liu |
collection | DOAJ |
description | Video Super-Resolution (VSR) aims to reconstruct high-quality high-resolution (HR) videos from low-resolution (LR) inputs. Recent studies have explored diffusion models (DMs) for VSR by exploiting their generative priors to produce realistic details. However, the inherent randomness of diffusion models presents significant challenges for controlling content. In particular, current DM-based VSR methods often neglect inter-frame temporal coherence and reconstruction-oriented objectives, leading to visual distortion and temporal inconsistency. In this paper, we introduce VSRDiff, a DM-based framework for VSR that emphasizes inter-frame temporal coherence and adopts a novel reconstruction perspective. Specifically, the Inter-Frame Aggregation Guidance (IFAG) module is developed to learn contextual inter-frame aggregation guidance, alleviating visual distortion caused by the randomness of diffusion models. Furthermore, the Progressive Reconstruction Sampling (PRS) approach is employed to generate reconstruction-oriented latents, balancing fidelity and detail richness. Additionally, temporal consistency is enhanced through second-order bidirectional latent propagation using the Flow-guided Latent Correction (FLC) module. Extensive experiments on the REDS4 and Vid4 datasets demonstrate that VSRDiff achieves highly competitive VSR performance with more realistic details, surpassing existing state-of-the-art methods in both visual fidelity and temporal consistency. Specifically, VSRDiff achieves the best scores on the REDS4 dataset in LPIPS, DISTS, and NIQE, with values of 0.1137, 0.0445, and 2.970, respectively. The result will be released at <uri>https://github.com/aigcvsr/VSRDiff</uri>. |
format | Article |
id | doaj-art-c115a6f4eda34d0682b78d8cadf24ab3 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-c115a6f4eda34d0682b78d8cadf24ab32025-01-24T00:01:24ZengIEEEIEEE Access2169-35362025-01-0113114471146210.1109/ACCESS.2025.352975810840194VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-ResolutionLinlin Liu0https://orcid.org/0009-0008-1914-7033Lele Niu1https://orcid.org/0009-0005-1395-1707Jun Tang2https://orcid.org/0000-0003-0122-9512Yong Ding3https://orcid.org/0000-0002-5226-7511College of Integrated Circuits, Zhejiang University, Hangzhou, ChinaCollege of Integrated Circuits, Zhejiang University, Hangzhou, ChinaCollege of Integrated Circuits, Zhejiang University, Hangzhou, ChinaCollege of Integrated Circuits, Zhejiang University, Hangzhou, ChinaVideo Super-Resolution (VSR) aims to reconstruct high-quality high-resolution (HR) videos from low-resolution (LR) inputs. Recent studies have explored diffusion models (DMs) for VSR by exploiting their generative priors to produce realistic details. However, the inherent randomness of diffusion models presents significant challenges for controlling content. In particular, current DM-based VSR methods often neglect inter-frame temporal coherence and reconstruction-oriented objectives, leading to visual distortion and temporal inconsistency. In this paper, we introduce VSRDiff, a DM-based framework for VSR that emphasizes inter-frame temporal coherence and adopts a novel reconstruction perspective. Specifically, the Inter-Frame Aggregation Guidance (IFAG) module is developed to learn contextual inter-frame aggregation guidance, alleviating visual distortion caused by the randomness of diffusion models. Furthermore, the Progressive Reconstruction Sampling (PRS) approach is employed to generate reconstruction-oriented latents, balancing fidelity and detail richness. Additionally, temporal consistency is enhanced through second-order bidirectional latent propagation using the Flow-guided Latent Correction (FLC) module. Extensive experiments on the REDS4 and Vid4 datasets demonstrate that VSRDiff achieves highly competitive VSR performance with more realistic details, surpassing existing state-of-the-art methods in both visual fidelity and temporal consistency. Specifically, VSRDiff achieves the best scores on the REDS4 dataset in LPIPS, DISTS, and NIQE, with values of 0.1137, 0.0445, and 2.970, respectively. The result will be released at <uri>https://github.com/aigcvsr/VSRDiff</uri>.https://ieeexplore.ieee.org/document/10840194/Video super-resolutiondiffusion modelsdenoising diffusion probabilistic modelsdeep learningconvolutional neural network |
spellingShingle | Linlin Liu Lele Niu Jun Tang Yong Ding VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution IEEE Access Video super-resolution diffusion models denoising diffusion probabilistic models deep learning convolutional neural network |
title | VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution |
title_full | VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution |
title_fullStr | VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution |
title_full_unstemmed | VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution |
title_short | VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution |
title_sort | vsrdiff learning inter frame temporal coherence in diffusion model for video super resolution |
topic | Video super-resolution diffusion models denoising diffusion probabilistic models deep learning convolutional neural network |
url | https://ieeexplore.ieee.org/document/10840194/ |
work_keys_str_mv | AT linlinliu vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution AT leleniu vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution AT juntang vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution AT yongding vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution |