A Design of Workflow-Based Automated Failure Recovery Framework in IoT Edge Environment

In IoT Edge environments, microservice-based architectures decompose workloads into functional units for efficient deployment and scaling across edge nodes. However, service dependencies and layered infrastructure often cause cascading failures, complicating automated failure recovery processes in I...

Full description

Saved in:
Bibliographic Details
Main Authors: Phuong Bac Ta, Vitumbiko Mafeni, Younghan Kim
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11030607/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In IoT Edge environments, microservice-based architectures decompose workloads into functional units for efficient deployment and scaling across edge nodes. However, service dependencies and layered infrastructure often cause cascading failures, complicating automated failure recovery processes in IoT Edge environment. Existing failure recovery methods usually rely on static, isolated recovery strategies, ignoring system-wide dependencies and resulting in recovery conflicts (e.g., multiple processes modifying the same target or violating recovery order), leading to redundant actions and prolonged downtime. To address these limitations, we propose a workflow-based automated recovery framework that structures recovery actions into coordinated workflows to enable conflict-free execution while flexibly adapting to evolving infrastructure, dependencies, and failure scenarios. The framework incorporates a <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {C}$ </tex-math></inline-formula>onflict-<inline-formula> <tex-math notation="LaTeX">$\boldsymbol {A}$ </tex-math></inline-formula>ware <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {W}$ </tex-math></inline-formula>orkflow <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {E}$ </tex-math></inline-formula>xecution (CAWE) mechanism that leverages a cluster dependency graph (CDG) and priority-based coordination to enforce correct recovery order and prevent conflicting operations. Experimental results demonstrate that eliminating conflicts during recovery improves recovery completion time by up to 47% compared to the baseline method without conflict awareness. Furthermore, by ensuring dependency-consistent coordination and prioritization of target objects, our approach achieves a higher recovery success rate with a minimum of 88% in severe failure conditions and substantially reduces peak data loss compared to traditional methods like failover and retry-failover. Although short-term resource usage increases (not exceeding 30% of system capacity) occur during recovery, this trade-off is justified by significant gains in recovery speed, reliability, and data preservation, ensuring stable service operation in IoT Edge environments.
ISSN:2169-3536