CNN-Based Time Series Decomposition Model for Video Prediction

Video prediction presents a formidable challenge, requiring effectively processing spatial and temporal information embedded in videos. While recurrent neural network (RNN) and transformer-based models have been extensively explored to address spatial changes over time, recent advancements in convol...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jinyoung Lee, Gyeyoung Kim
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Convolutional neural networks deep learning architecture spatiotemporal representation learning time series forecasting video prediction
Online Access:	https://ieeexplore.ieee.org/document/10676971/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841533456163536896
author	Jinyoung Lee Gyeyoung Kim
author_facet	Jinyoung Lee Gyeyoung Kim
author_sort	Jinyoung Lee
collection	DOAJ
description	Video prediction presents a formidable challenge, requiring effectively processing spatial and temporal information embedded in videos. While recurrent neural network (RNN) and transformer-based models have been extensively explored to address spatial changes over time, recent advancements in convolutional neural networks (CNNs) have yielded high-performance video prediction models. CNN-based models offer advantages over RNN and transformer-based models due to their ease of parallel processing and lower computational complexity, highlighting their significance in practical applications. However, existing CNN-based video prediction models typically treat the spatiotemporal channels of videos similarly to the channel axis of static images. They stack frames in temporal order to construct a spatiotemporal axis and employ standard <inline-formula> <tex-math notation="LaTeX">$1\times 1$ </tex-math></inline-formula> convolution operations. Nevertheless, this approach has its limitations. Applying <inline-formula> <tex-math notation="LaTeX">$1\times 1$ </tex-math></inline-formula> convolution directly to the spatiotemporal axis results in a mixture of temporal and spatial information, which may lead to computational inefficiencies and reduced accuracy. Additionally, this operation needs to improve in processing temporal data. This study introduces a CNN-based time series decomposition model for video prediction. The proposed model first divides the <inline-formula> <tex-math notation="LaTeX">$1\times 1$ </tex-math></inline-formula> convolution operation within the channel aggregation module to independently process the temporal and spatial dimensions. To capture evolving features, the temporal axis is segregated into trend and residual components, followed by applying a time series decomposition forecasting method. To assess the performance of the proposed technique, experiments were conducted using the moving MNIST, KTH, and KITTI-Caltech benchmark datasets. In the experiments on moving MNIST, despite a reduction of approximately 55% in the number of parameters and 37% in computational cost, the proposed method improved accuracy by up to 7% compared to the previous approach.
format	Article
id	doaj-art-fa4e35b9dbee4db583d746eefe366b5d
institution	Kabale University
issn	2169-3536
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-fa4e35b9dbee4db583d746eefe366b5d2025-01-16T00:02:15ZengIEEEIEEE Access2169-35362024-01-011213120513121610.1109/ACCESS.2024.345846010676971CNN-Based Time Series Decomposition Model for Video PredictionJinyoung Lee0https://orcid.org/0009-0000-9950-6738Gyeyoung Kim1https://orcid.org/0000-0001-6908-6920School of Software, Soongsil University, Seoul, South KoreaSchool of Software, Soongsil University, Seoul, South KoreaVideo prediction presents a formidable challenge, requiring effectively processing spatial and temporal information embedded in videos. While recurrent neural network (RNN) and transformer-based models have been extensively explored to address spatial changes over time, recent advancements in convolutional neural networks (CNNs) have yielded high-performance video prediction models. CNN-based models offer advantages over RNN and transformer-based models due to their ease of parallel processing and lower computational complexity, highlighting their significance in practical applications. However, existing CNN-based video prediction models typically treat the spatiotemporal channels of videos similarly to the channel axis of static images. They stack frames in temporal order to construct a spatiotemporal axis and employ standard <inline-formula> <tex-math notation="LaTeX">$1\times 1$ </tex-math></inline-formula> convolution operations. Nevertheless, this approach has its limitations. Applying <inline-formula> <tex-math notation="LaTeX">$1\times 1$ </tex-math></inline-formula> convolution directly to the spatiotemporal axis results in a mixture of temporal and spatial information, which may lead to computational inefficiencies and reduced accuracy. Additionally, this operation needs to improve in processing temporal data. This study introduces a CNN-based time series decomposition model for video prediction. The proposed model first divides the <inline-formula> <tex-math notation="LaTeX">$1\times 1$ </tex-math></inline-formula> convolution operation within the channel aggregation module to independently process the temporal and spatial dimensions. To capture evolving features, the temporal axis is segregated into trend and residual components, followed by applying a time series decomposition forecasting method. To assess the performance of the proposed technique, experiments were conducted using the moving MNIST, KTH, and KITTI-Caltech benchmark datasets. In the experiments on moving MNIST, despite a reduction of approximately 55% in the number of parameters and 37% in computational cost, the proposed method improved accuracy by up to 7% compared to the previous approach.https://ieeexplore.ieee.org/document/10676971/Convolutional neural networksdeep learning architecturespatiotemporal representation learningtime series forecastingvideo prediction
spellingShingle	Jinyoung Lee Gyeyoung Kim CNN-Based Time Series Decomposition Model for Video Prediction IEEE Access Convolutional neural networks deep learning architecture spatiotemporal representation learning time series forecasting video prediction
title	CNN-Based Time Series Decomposition Model for Video Prediction
title_full	CNN-Based Time Series Decomposition Model for Video Prediction
title_fullStr	CNN-Based Time Series Decomposition Model for Video Prediction
title_full_unstemmed	CNN-Based Time Series Decomposition Model for Video Prediction
title_short	CNN-Based Time Series Decomposition Model for Video Prediction
title_sort	cnn based time series decomposition model for video prediction
topic	Convolutional neural networks deep learning architecture spatiotemporal representation learning time series forecasting video prediction
url	https://ieeexplore.ieee.org/document/10676971/
work_keys_str_mv	AT jinyounglee cnnbasedtimeseriesdecompositionmodelforvideoprediction AT gyeyoungkim cnnbasedtimeseriesdecompositionmodelforvideoprediction

CNN-Based Time Series Decomposition Model for Video Prediction

Similar Items