Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices

Given the explosive growth in video content generation, there is a rising demand for efficient and scalable video recognition. Deep learning has shown its remarkable performance in video analytics, by applying 2D or 3D Convolutional Neural Networks (CNNs) across multiple video frames. However, high...

Full description

Saved in:
Bibliographic Details
Main Authors: Qingli Wang, Chengwu Yu, Shan Chen, Weiwei Fang, Naixue Xiong
Format: Article
Language:English
Published: Tsinghua University Press 2025-05-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2024.9020093
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849407022571716608
author Qingli Wang
Chengwu Yu
Shan Chen
Weiwei Fang
Naixue Xiong
author_facet Qingli Wang
Chengwu Yu
Shan Chen
Weiwei Fang
Naixue Xiong
author_sort Qingli Wang
collection DOAJ
description Given the explosive growth in video content generation, there is a rising demand for efficient and scalable video recognition. Deep learning has shown its remarkable performance in video analytics, by applying 2D or 3D Convolutional Neural Networks (CNNs) across multiple video frames. However, high data quantities, intensive computational costs, and various performance requirements restrict the deployment and application of these video-oriented models on resource-constrained edge devices, e.g., Internet-of-Things (IoT) and mobile devices. To tackle this issue, we propose a joint optimization system RSEE by adaptive Resolution Selection (RS) and conditional Early Exiting (EE) to facilitate efficient video recognition based on 2D CNN backbones. Given a video frame, RSEE firstly determines what input resolution is to be used for processing by the dynamic resolution selector, then sends the resolution-adjusted frame into the backbone network to extract features, and finally determines whether to stop further processing based on the accumulated features of current video at the early-exiting gate. Extensive experiments conducted on benchmark datasets indicate that RSEE remarkably outperforms current state-of-the-art solutions in terms of computational cost (by up to 84.72% on UCF101 and 78.93% on HMDB51) and inference speed (up to 3.18× on UCF101 and 3.50× on HMDB51), while still preserving competitive recognition accuracy (up to 7.81% on UCF101 7.21% on HMDB51). Furthermore, the superiority of RSEE on resource-constrained edge devices is validated on the NVIDIA Jetson Nano, with processing speeds controlled by hyperparameters ranging from about 12 to 60 Frame-Per-Second (FPS) that well enable real-time analysis.
format Article
id doaj-art-01878a0f6dfb4e88bb86052a52f39f3b
institution Kabale University
issn 2096-0654
2097-406X
language English
publishDate 2025-05-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-01878a0f6dfb4e88bb86052a52f39f3b2025-08-20T03:36:12ZengTsinghua University PressBig Data Mining and Analytics2096-06542097-406X2025-05-018366167710.26599/BDMA.2024.9020093Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge DevicesQingli Wang0Chengwu Yu1Shan Chen2Weiwei Fang3Naixue Xiong4Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, ChinaBeijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, ChinaBeijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, ChinaBeijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, ChinaDepartment of Computer Science and Mathematics, Sul Ross State University, Alpine, TX 79832, USAGiven the explosive growth in video content generation, there is a rising demand for efficient and scalable video recognition. Deep learning has shown its remarkable performance in video analytics, by applying 2D or 3D Convolutional Neural Networks (CNNs) across multiple video frames. However, high data quantities, intensive computational costs, and various performance requirements restrict the deployment and application of these video-oriented models on resource-constrained edge devices, e.g., Internet-of-Things (IoT) and mobile devices. To tackle this issue, we propose a joint optimization system RSEE by adaptive Resolution Selection (RS) and conditional Early Exiting (EE) to facilitate efficient video recognition based on 2D CNN backbones. Given a video frame, RSEE firstly determines what input resolution is to be used for processing by the dynamic resolution selector, then sends the resolution-adjusted frame into the backbone network to extract features, and finally determines whether to stop further processing based on the accumulated features of current video at the early-exiting gate. Extensive experiments conducted on benchmark datasets indicate that RSEE remarkably outperforms current state-of-the-art solutions in terms of computational cost (by up to 84.72% on UCF101 and 78.93% on HMDB51) and inference speed (up to 3.18× on UCF101 and 3.50× on HMDB51), while still preserving competitive recognition accuracy (up to 7.81% on UCF101 7.21% on HMDB51). Furthermore, the superiority of RSEE on resource-constrained edge devices is validated on the NVIDIA Jetson Nano, with processing speeds controlled by hyperparameters ranging from about 12 to 60 Frame-Per-Second (FPS) that well enable real-time analysis.https://www.sciopen.com/article/10.26599/BDMA.2024.9020093deep learningedge intelligenceresolution selectionearly exitvideo analytics
spellingShingle Qingli Wang
Chengwu Yu
Shan Chen
Weiwei Fang
Naixue Xiong
Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices
Big Data Mining and Analytics
deep learning
edge intelligence
resolution selection
early exit
video analytics
title Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices
title_full Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices
title_fullStr Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices
title_full_unstemmed Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices
title_short Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices
title_sort joint adaptive resolution selection and conditional early exiting for efficient video recognition on edge devices
topic deep learning
edge intelligence
resolution selection
early exit
video analytics
url https://www.sciopen.com/article/10.26599/BDMA.2024.9020093
work_keys_str_mv AT qingliwang jointadaptiveresolutionselectionandconditionalearlyexitingforefficientvideorecognitiononedgedevices
AT chengwuyu jointadaptiveresolutionselectionandconditionalearlyexitingforefficientvideorecognitiononedgedevices
AT shanchen jointadaptiveresolutionselectionandconditionalearlyexitingforefficientvideorecognitiononedgedevices
AT weiweifang jointadaptiveresolutionselectionandconditionalearlyexitingforefficientvideorecognitiononedgedevices
AT naixuexiong jointadaptiveresolutionselectionandconditionalearlyexitingforefficientvideorecognitiononedgedevices