Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge

Edge devices execute pre-trained Artificial Intelligence (AI) models optimized on large Graphical Processing Units (GPUs); however, they frequently require fine-tuning when deployed in the real world. This fine-tuning, referred to as edge learning, is essential for personalized tasks such as speech...

Full description

Saved in:
Bibliographic Details
Main Authors: Yicheng Zhang, Bojian Yin, Manil Dev Gomony, Henk Corporaal, Carsten Trinitis, Federico Corradi
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Journal of Low Power Electronics and Applications
Subjects:
Online Access:https://www.mdpi.com/2079-9268/15/1/15
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Edge devices execute pre-trained Artificial Intelligence (AI) models optimized on large Graphical Processing Units (GPUs); however, they frequently require fine-tuning when deployed in the real world. This fine-tuning, referred to as edge learning, is essential for personalized tasks such as speech and gesture recognition, which often necessitate the use of recurrent neural networks (RNNs). However, training RNNs on edge devices presents major challenges due to limited memory and computing resources. In this study, we propose a system for RNN training through sequence partitioning using the Forward Propagation Through Time (FPTT) training method, thereby enabling edge learning. Our optimized hardware/software co-design for FPTT represents a novel contribution in this domain. This research demonstrates the viability of FPTT for fine-tuning real-world applications by implementing a complete computational framework for training Long Short-Term Memory (LSTM) networks utilizing FPTT. Moreover, this work incorporates the optimization and exploration of a scalable digital hardware architecture using an open-source hardware-design framework, named Chipyard and its implementation on a Field-Programmable Gate Array (FPGA) for cycle-accurate verification. The empirical results demonstrate that partitioned training on the proposed architecture enables an 8.2-fold reduction in memory usage with only a 0.2× increase in latency for small-batch sequential MNIST (S-MNIST) compared to traditional non-partitioned training.
ISSN:2079-9268