About the confusion-matrix-based assessment of the results of imbalanced data classification

When applying classifiers in real applications, the data imbalance often occurs when the number of elements of one class is greater than another. The article examines the estimates of the classification results for this type of data. The paper provides answers to three questions: which term is a mor...

Full description

Saved in:

Bibliographic Details
Main Authors:	V. V. Starovoitov, Yu. I. Golub
Format:	Article
Language:	Russian
Published:	National Academy of Sciences of Belarus, the United Institute of Informatics Problems 2021-03-01
Series:	Informatika
Subjects:	classification imbalanced data confusion matrix classification accuracy functions accuracy paradox neural network disease diagnosis
Online Access:	https://inf.grid.by/jour/article/view/1121
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832543360657653760
author	V. V. Starovoitov Yu. I. Golub
author_facet	V. V. Starovoitov Yu. I. Golub
author_sort	V. V. Starovoitov
collection	DOAJ
description	When applying classifiers in real applications, the data imbalance often occurs when the number of elements of one class is greater than another. The article examines the estimates of the classification results for this type of data. The paper provides answers to three questions: which term is a more accurate translation of the phrase "confusion matrix", how preferable to represent data in this matrix, and what functions to be better used to evaluate the results of classification by such a matrix. The paper demonstrates on real data that the popular accuracy function cannot correctly estimate the classification errors for imbalanced data. It is also impossible to compare the values of this function, calculated by matrices with absolute quantitative results of classification and normalized by classes. If the data is imbalanced, the accuracy calculated from the confusion matrix with normalized values will usually have lower values, since it is calculated by a different formula. The same conclusion is made for most of the classification accuracy functions used in the literature for estimation of classification results. It is shown that to represent confusion matrices it is better to use absolute values of object distribution by classes instead of relative ones, since they give an idea of the amount of data tested for each class and their imbalance. When constructing classifiers, it is recommended to evaluate errors by functions that do not depend on the data imbalance, that allows to hope for more correct classification results for real data.
format	Article
id	doaj-art-4ec8375139694ee3a7048a98b60d6b50
institution	Kabale University
issn	1816-0301
language	Russian
publishDate	2021-03-01
publisher	National Academy of Sciences of Belarus, the United Institute of Informatics Problems
record_format	Article
series	Informatika
spelling	doaj-art-4ec8375139694ee3a7048a98b60d6b502025-02-03T11:46:28ZrusNational Academy of Sciences of Belarus, the United Institute of Informatics ProblemsInformatika1816-03012021-03-01181617110.37661/10.37661/1816-0301-2021-18-1-61-71959About the confusion-matrix-based assessment of the results of imbalanced data classificationV. V. Starovoitov0Yu. I. Golub1The United Institute of Informatics Problems of the National Academy of Sciences of BelarusThe United Institute of Informatics Problems of the National Academy of Sciences of BelarusWhen applying classifiers in real applications, the data imbalance often occurs when the number of elements of one class is greater than another. The article examines the estimates of the classification results for this type of data. The paper provides answers to three questions: which term is a more accurate translation of the phrase "confusion matrix", how preferable to represent data in this matrix, and what functions to be better used to evaluate the results of classification by such a matrix. The paper demonstrates on real data that the popular accuracy function cannot correctly estimate the classification errors for imbalanced data. It is also impossible to compare the values of this function, calculated by matrices with absolute quantitative results of classification and normalized by classes. If the data is imbalanced, the accuracy calculated from the confusion matrix with normalized values will usually have lower values, since it is calculated by a different formula. The same conclusion is made for most of the classification accuracy functions used in the literature for estimation of classification results. It is shown that to represent confusion matrices it is better to use absolute values of object distribution by classes instead of relative ones, since they give an idea of the amount of data tested for each class and their imbalance. When constructing classifiers, it is recommended to evaluate errors by functions that do not depend on the data imbalance, that allows to hope for more correct classification results for real data.https://inf.grid.by/jour/article/view/1121classificationimbalanced dataconfusion matrixclassification accuracy functionsaccuracy paradoxneural networkdisease diagnosis
spellingShingle	V. V. Starovoitov Yu. I. Golub About the confusion-matrix-based assessment of the results of imbalanced data classification Informatika classification imbalanced data confusion matrix classification accuracy functions accuracy paradox neural network disease diagnosis
title	About the confusion-matrix-based assessment of the results of imbalanced data classification
title_full	About the confusion-matrix-based assessment of the results of imbalanced data classification
title_fullStr	About the confusion-matrix-based assessment of the results of imbalanced data classification
title_full_unstemmed	About the confusion-matrix-based assessment of the results of imbalanced data classification
title_short	About the confusion-matrix-based assessment of the results of imbalanced data classification
title_sort	about the confusion matrix based assessment of the results of imbalanced data classification
topic	classification imbalanced data confusion matrix classification accuracy functions accuracy paradox neural network disease diagnosis
url	https://inf.grid.by/jour/article/view/1121
work_keys_str_mv	AT vvstarovoitov abouttheconfusionmatrixbasedassessmentoftheresultsofimbalanceddataclassification AT yuigolub abouttheconfusionmatrixbasedassessmentoftheresultsofimbalanceddataclassification

About the confusion-matrix-based assessment of the results of imbalanced data classification

Similar Items