Automated Root Cause Analysis of Network Failures in IP/MPLS Network Using Machine Learning and Case-Based Reasoning
Managing IP/MPLS networks requires advanced tools due to their inherent complexity. Problems such as chain failures can be particularly challenging to resolve, as a single issue may impact multiple devices. This study introduces an integrated system aimed at improving the management of IP/MPLS netwo...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11053841/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Managing IP/MPLS networks requires advanced tools due to their inherent complexity. Problems such as chain failures can be particularly challenging to resolve, as a single issue may impact multiple devices. This study introduces an integrated system aimed at improving the management of IP/MPLS networks by automatically identifying the root causes of network failures—particularly within large-scale environments. The proposed system features a dual-layered architecture comprising a Log Analysis Layer and an Operation and Maintenance Layer. The Log Analysis Layer enhances message uniformity by standardizing event logs through template generation, which replaces variable elements with wildcards. The Operation and Maintenance Layer includes components such as an event analysis service, node chain lookup, and node test service. These modules work together to identify affected devices, collect diagnostic metrics, and filter out critical events. A supervised learning model is employed to classify event messages, trained on a dataset of over seven million entries. The use of Term Frequency-Inverse Document Frequency for feature extraction improves classification accuracy by emphasizing distinctive terms over commonly occurring ones. Among the models evaluated, the SVM algorithm achieved the highest performance, with an F1-score of 0.969. The system integrates Apache Kafka as a high-throughput message broker to enable real-time processing of SNMP Traps and Syslog data. Additionally, a case-based fault identification service automates fault analysis and provides actionable insights via an interactive dashboard and a notification system that delivers alerts through modern messaging platforms. Experimental results demonstrate significant improvements in network resilience, including reduced reliance on manual troubleshooting, enhanced decision-making accuracy, and faster fault recovery times. |
|---|---|
| ISSN: | 2169-3536 |