VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution

Video Super-Resolution (VSR) aims to reconstruct high-quality high-resolution (HR) videos from low-resolution (LR) inputs. Recent studies have explored diffusion models (DMs) for VSR by exploiting their generative priors to produce realistic details. However, the inherent randomness of diffusion mod...

Full description

Saved in:
Bibliographic Details
Main Authors: Linlin Liu, Lele Niu, Jun Tang, Yong Ding
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10840194/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832590356522205184
author Linlin Liu
Lele Niu
Jun Tang
Yong Ding
author_facet Linlin Liu
Lele Niu
Jun Tang
Yong Ding
author_sort Linlin Liu
collection DOAJ
description Video Super-Resolution (VSR) aims to reconstruct high-quality high-resolution (HR) videos from low-resolution (LR) inputs. Recent studies have explored diffusion models (DMs) for VSR by exploiting their generative priors to produce realistic details. However, the inherent randomness of diffusion models presents significant challenges for controlling content. In particular, current DM-based VSR methods often neglect inter-frame temporal coherence and reconstruction-oriented objectives, leading to visual distortion and temporal inconsistency. In this paper, we introduce VSRDiff, a DM-based framework for VSR that emphasizes inter-frame temporal coherence and adopts a novel reconstruction perspective. Specifically, the Inter-Frame Aggregation Guidance (IFAG) module is developed to learn contextual inter-frame aggregation guidance, alleviating visual distortion caused by the randomness of diffusion models. Furthermore, the Progressive Reconstruction Sampling (PRS) approach is employed to generate reconstruction-oriented latents, balancing fidelity and detail richness. Additionally, temporal consistency is enhanced through second-order bidirectional latent propagation using the Flow-guided Latent Correction (FLC) module. Extensive experiments on the REDS4 and Vid4 datasets demonstrate that VSRDiff achieves highly competitive VSR performance with more realistic details, surpassing existing state-of-the-art methods in both visual fidelity and temporal consistency. Specifically, VSRDiff achieves the best scores on the REDS4 dataset in LPIPS, DISTS, and NIQE, with values of 0.1137, 0.0445, and 2.970, respectively. The result will be released at <uri>https://github.com/aigcvsr/VSRDiff</uri>.
format Article
id doaj-art-c115a6f4eda34d0682b78d8cadf24ab3
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-c115a6f4eda34d0682b78d8cadf24ab32025-01-24T00:01:24ZengIEEEIEEE Access2169-35362025-01-0113114471146210.1109/ACCESS.2025.352975810840194VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-ResolutionLinlin Liu0https://orcid.org/0009-0008-1914-7033Lele Niu1https://orcid.org/0009-0005-1395-1707Jun Tang2https://orcid.org/0000-0003-0122-9512Yong Ding3https://orcid.org/0000-0002-5226-7511College of Integrated Circuits, Zhejiang University, Hangzhou, ChinaCollege of Integrated Circuits, Zhejiang University, Hangzhou, ChinaCollege of Integrated Circuits, Zhejiang University, Hangzhou, ChinaCollege of Integrated Circuits, Zhejiang University, Hangzhou, ChinaVideo Super-Resolution (VSR) aims to reconstruct high-quality high-resolution (HR) videos from low-resolution (LR) inputs. Recent studies have explored diffusion models (DMs) for VSR by exploiting their generative priors to produce realistic details. However, the inherent randomness of diffusion models presents significant challenges for controlling content. In particular, current DM-based VSR methods often neglect inter-frame temporal coherence and reconstruction-oriented objectives, leading to visual distortion and temporal inconsistency. In this paper, we introduce VSRDiff, a DM-based framework for VSR that emphasizes inter-frame temporal coherence and adopts a novel reconstruction perspective. Specifically, the Inter-Frame Aggregation Guidance (IFAG) module is developed to learn contextual inter-frame aggregation guidance, alleviating visual distortion caused by the randomness of diffusion models. Furthermore, the Progressive Reconstruction Sampling (PRS) approach is employed to generate reconstruction-oriented latents, balancing fidelity and detail richness. Additionally, temporal consistency is enhanced through second-order bidirectional latent propagation using the Flow-guided Latent Correction (FLC) module. Extensive experiments on the REDS4 and Vid4 datasets demonstrate that VSRDiff achieves highly competitive VSR performance with more realistic details, surpassing existing state-of-the-art methods in both visual fidelity and temporal consistency. Specifically, VSRDiff achieves the best scores on the REDS4 dataset in LPIPS, DISTS, and NIQE, with values of 0.1137, 0.0445, and 2.970, respectively. The result will be released at <uri>https://github.com/aigcvsr/VSRDiff</uri>.https://ieeexplore.ieee.org/document/10840194/Video super-resolutiondiffusion modelsdenoising diffusion probabilistic modelsdeep learningconvolutional neural network
spellingShingle Linlin Liu
Lele Niu
Jun Tang
Yong Ding
VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
IEEE Access
Video super-resolution
diffusion models
denoising diffusion probabilistic models
deep learning
convolutional neural network
title VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
title_full VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
title_fullStr VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
title_full_unstemmed VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
title_short VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution
title_sort vsrdiff learning inter frame temporal coherence in diffusion model for video super resolution
topic Video super-resolution
diffusion models
denoising diffusion probabilistic models
deep learning
convolutional neural network
url https://ieeexplore.ieee.org/document/10840194/
work_keys_str_mv AT linlinliu vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution
AT leleniu vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution
AT juntang vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution
AT yongding vsrdifflearninginterframetemporalcoherenceindiffusionmodelforvideosuperresolution