Multimodal Fall Detection Using Spatial–Temporal Attention and Bi-LSTM-Based Feature Fusion

Human fall detection is a significant healthcare concern, particularly among the elderly, due to its links to muscle weakness, cardiovascular issues, and locomotive syndrome. Accurate fall detection is crucial for timely intervention and injury prevention, which has led many researchers to work on d...

Full description

Saved in:
Bibliographic Details
Main Authors: Jungpil Shin, Abu Saleh Musa Miah, Rei Egawa, Najmul Hassan, Koki Hirooka, Yoichi Tomioka
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/17/4/173
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Human fall detection is a significant healthcare concern, particularly among the elderly, due to its links to muscle weakness, cardiovascular issues, and locomotive syndrome. Accurate fall detection is crucial for timely intervention and injury prevention, which has led many researchers to work on developing effective detection systems. However, existing unimodal systems that rely solely on skeleton or sensor data face challenges such as poor robustness, computational inefficiency, and sensitivity to environmental conditions. While some multimodal approaches have been proposed, they often struggle to capture long-range dependencies effectively. In order to address these challenges, we propose a multimodal fall detection framework that integrates skeleton and sensor data. The system uses a Graph-based Spatial-Temporal Convolutional and Attention Neural Network (GSTCAN) to capture spatial and temporal relationships from skeleton and motion data information in stream-1, while a Bi-LSTM with Channel Attention (CA) processes sensor data in stream-2, extracting both spatial and temporal features. The GSTCAN model uses AlphaPose for skeleton extraction, calculates motion between consecutive frames, and applies a graph convolutional network (GCN) with a CA mechanism to focus on relevant features while suppressing noise. In parallel, the Bi-LSTM with CA processes inertial signals, with Bi-LSTM capturing long-range temporal dependencies and CA refining feature representations. The features from both branches are fused and passed through a fully connected layer for classification, providing a comprehensive understanding of human motion. The proposed system was evaluated on the Fall Up and UR Fall datasets, achieving a classification accuracy of 99.09% and 99.32%, respectively, surpassing existing methods. This robust and efficient system demonstrates strong potential for accurate fall detection and continuous healthcare monitoring.
ISSN:1999-5903