Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems

The dynamic and complex nature of distributed systems makes fault localization extremely difficult, frequently leading to extended outages and higher operating expenses. A deep learning-based fault localization framework that combines Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convol...

Full description

Saved in:
Bibliographic Details
Main Authors: Debolina Ghosh, Jay Prakash Singh
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11075581/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The dynamic and complex nature of distributed systems makes fault localization extremely difficult, frequently leading to extended outages and higher operating expenses. A deep learning-based fault localization framework that combines Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), LSTM+CNN, and Autoencoder+LSTM models is proposed in this study. These models undergo extensive preprocessing, including log parsing, feature extraction using TF-IDF and Word2Vec, and min-max normalisation, before being trained and assessed on five benchmark datasets: HDFS, OpenStack, Spark, Hadoop, and BGL. To ensure robustness, the methodology incorporates a 5-fold cross-validation strategy, model-specific architecture tuning, and 1-D sequence modelling. According to experimental results, CNN performs best overall on the HDFS dataset, with an Mean Squared Error (MSE) of 0.00002 and an Coefficient of Determination (R2 Score) Score of 0.996. CNN continuously beats other models in accuracy and performance across all datasets. The key contributions of this study are: 1) a thorough fault localization framework built with deep learning for distributed systems; 2) a comparison of five cutting-edge architectures on five real-world datasets; and 3) statistically validated performance benchmarks backed by Wilcoxon signed-rank tests and t-tests. These contributions provide useful information for implementing accurate and scalable fault localization in distributed computing environments found in the real world.
ISSN:2169-3536