A Design of Workflow-Based Automated Failure Recovery Framework in IoT Edge Environment
In IoT Edge environments, microservice-based architectures decompose workloads into functional units for efficient deployment and scaling across edge nodes. However, service dependencies and layered infrastructure often cause cascading failures, complicating automated failure recovery processes in I...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11030607/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In IoT Edge environments, microservice-based architectures decompose workloads into functional units for efficient deployment and scaling across edge nodes. However, service dependencies and layered infrastructure often cause cascading failures, complicating automated failure recovery processes in IoT Edge environment. Existing failure recovery methods usually rely on static, isolated recovery strategies, ignoring system-wide dependencies and resulting in recovery conflicts (e.g., multiple processes modifying the same target or violating recovery order), leading to redundant actions and prolonged downtime. To address these limitations, we propose a workflow-based automated recovery framework that structures recovery actions into coordinated workflows to enable conflict-free execution while flexibly adapting to evolving infrastructure, dependencies, and failure scenarios. The framework incorporates a <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {C}$ </tex-math></inline-formula>onflict-<inline-formula> <tex-math notation="LaTeX">$\boldsymbol {A}$ </tex-math></inline-formula>ware <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {W}$ </tex-math></inline-formula>orkflow <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {E}$ </tex-math></inline-formula>xecution (CAWE) mechanism that leverages a cluster dependency graph (CDG) and priority-based coordination to enforce correct recovery order and prevent conflicting operations. Experimental results demonstrate that eliminating conflicts during recovery improves recovery completion time by up to 47% compared to the baseline method without conflict awareness. Furthermore, by ensuring dependency-consistent coordination and prioritization of target objects, our approach achieves a higher recovery success rate with a minimum of 88% in severe failure conditions and substantially reduces peak data loss compared to traditional methods like failover and retry-failover. Although short-term resource usage increases (not exceeding 30% of system capacity) occur during recovery, this trade-off is justified by significant gains in recovery speed, reliability, and data preservation, ensuring stable service operation in IoT Edge environments. |
|---|---|
| ISSN: | 2169-3536 |