Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring

The growth of telemetry volume in distributed IT systems leads to "information noise" and increases the computational costs of AIOps platforms. This paper proposes a formalized two-stage metric selection procedure designed to improve the accuracy and efficiency of predictive monitoring: (1...

Full description

Saved in:
Bibliographic Details
Main Author: Andrew Egorkin
Format: Article
Language:Russian
Published: The Fund for Promotion of Internet media, IT education, human development «League Internet Media» 2025-04-01
Series:Современные информационные технологии и IT-образование
Subjects:
Online Access:https://sitito.cs.msu.ru/index.php/SITITO/article/view/1193
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850066909106536448
author Andrew Egorkin
author_facet Andrew Egorkin
author_sort Andrew Egorkin
collection DOAJ
description The growth of telemetry volume in distributed IT systems leads to "information noise" and increases the computational costs of AIOps platforms. This paper proposes a formalized two-stage metric selection procedure designed to improve the accuracy and efficiency of predictive monitoring: (1) a multicriteria correlation filter using Pearson coefficients (|r| > 0.60), Kendall’s τ (> 0.50), and Maximal Information Coefficient (MICe > 0.35) to eliminate redundant and non-linearly related features; (2) verification of causal relationships using the Granger test (lag = 5, p < 0.01), the PCMCI algorithm (FDR = 10%), and the Directed Information metric (DI > 0.1 bits/step) to identify true drivers of the target metric. Experimental validation was conducted on a 14-day fragment of Prometheus metrics from the industrial cluster of the "Sber Antifraud" system (≈7 billion data points, 1379 initial metrics). The results showed a 43% reduction in the Mean Absolute Error (MAE) of 30-minute CPU utilization forecasts, a 14-fold decrease in input time series, and an 89% reduction in model inference time. The methodology is integrated into an industrial data processing pipeline (Prometheus → Kafka → Spark 3.5 → MLflow 2.11) and aligns with the data minimization principle outlined in GOST R 57580.1-2017 and FSTEC guidelines for information protection.
format Article
id doaj-art-be7e7bc2f55e4b239acdf6eddf62129c
institution DOAJ
issn 2411-1473
language Russian
publishDate 2025-04-01
publisher The Fund for Promotion of Internet media, IT education, human development «League Internet Media»
record_format Article
series Современные информационные технологии и IT-образование
spelling doaj-art-be7e7bc2f55e4b239acdf6eddf62129c2025-08-20T02:48:34ZrusThe Fund for Promotion of Internet media, IT education, human development «League Internet Media»Современные информационные технологии и IT-образование2411-14732025-04-01211364510.25559/SITITO.021.202501.36-45Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident MonitoringAndrew Egorkin0https://orcid.org/0009-0002-9329-3641Lomonosov Moscow State University; Sberbank of Russia, Moscow, RussiaThe growth of telemetry volume in distributed IT systems leads to "information noise" and increases the computational costs of AIOps platforms. This paper proposes a formalized two-stage metric selection procedure designed to improve the accuracy and efficiency of predictive monitoring: (1) a multicriteria correlation filter using Pearson coefficients (|r| > 0.60), Kendall’s τ (> 0.50), and Maximal Information Coefficient (MICe > 0.35) to eliminate redundant and non-linearly related features; (2) verification of causal relationships using the Granger test (lag = 5, p < 0.01), the PCMCI algorithm (FDR = 10%), and the Directed Information metric (DI > 0.1 bits/step) to identify true drivers of the target metric. Experimental validation was conducted on a 14-day fragment of Prometheus metrics from the industrial cluster of the "Sber Antifraud" system (≈7 billion data points, 1379 initial metrics). The results showed a 43% reduction in the Mean Absolute Error (MAE) of 30-minute CPU utilization forecasts, a 14-fold decrease in input time series, and an 89% reduction in model inference time. The methodology is integrated into an industrial data processing pipeline (Prometheus → Kafka → Spark 3.5 → MLflow 2.11) and aligns with the data minimization principle outlined in GOST R 57580.1-2017 and FSTEC guidelines for information protection.https://sitito.cs.msu.ru/index.php/SITITO/article/view/1193aiopspredictive monitoringfeature selectioncorrelationmicecausalitycausal analysispcmcidirected informationsregreen sretime series
spellingShingle Andrew Egorkin
Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
Современные информационные технологии и IT-образование
aiops
predictive monitoring
feature selection
correlation
mice
causality
causal analysis
pcmci
directed information
sre
green sre
time series
title Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
title_full Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
title_fullStr Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
title_full_unstemmed Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
title_short Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
title_sort development and analysis of a methodology for selecting infrastructure metrics for predictive incident monitoring
topic aiops
predictive monitoring
feature selection
correlation
mice
causality
causal analysis
pcmci
directed information
sre
green sre
time series
url https://sitito.cs.msu.ru/index.php/SITITO/article/view/1193
work_keys_str_mv AT andrewegorkin developmentandanalysisofamethodologyforselectinginfrastructuremetricsforpredictiveincidentmonitoring