Application of weighted low rank approximations: outlier detection in a data matrix

Abstract Objective A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers i...

Full description

Saved in:
Bibliographic Details
Main Authors: Marisol García-Peña, Sergio Arciniegas-Alarcón, Kaye E. Basford
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Research Notes
Subjects:
Online Access:https://doi.org/10.1186/s13104-025-07284-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850243514185547776
author Marisol García-Peña
Sergio Arciniegas-Alarcón
Kaye E. Basford
author_facet Marisol García-Peña
Sergio Arciniegas-Alarcón
Kaye E. Basford
author_sort Marisol García-Peña
collection DOAJ
description Abstract Objective A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear. Results Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.
format Article
id doaj-art-821d9ecf8df943a79ecf098ff72fae6d
institution OA Journals
issn 1756-0500
language English
publishDate 2025-05-01
publisher BMC
record_format Article
series BMC Research Notes
spelling doaj-art-821d9ecf8df943a79ecf098ff72fae6d2025-08-20T01:59:57ZengBMCBMC Research Notes1756-05002025-05-0118111110.1186/s13104-025-07284-2Application of weighted low rank approximations: outlier detection in a data matrixMarisol García-Peña0Sergio Arciniegas-Alarcón1Kaye E. Basford2Departamento de Matemáticas, Pontificia Universidad JaverianaFacultad de Ingeniería, Universidad de La SabanaSchool of Agriculture and Food Sustainability, The University of QueenslandAbstract Objective A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear. Results Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.https://doi.org/10.1186/s13104-025-07284-2Criss-cross regressionData preprocessingGenotype-by-environment interactionExploratory analysisAtypical elements
spellingShingle Marisol García-Peña
Sergio Arciniegas-Alarcón
Kaye E. Basford
Application of weighted low rank approximations: outlier detection in a data matrix
BMC Research Notes
Criss-cross regression
Data preprocessing
Genotype-by-environment interaction
Exploratory analysis
Atypical elements
title Application of weighted low rank approximations: outlier detection in a data matrix
title_full Application of weighted low rank approximations: outlier detection in a data matrix
title_fullStr Application of weighted low rank approximations: outlier detection in a data matrix
title_full_unstemmed Application of weighted low rank approximations: outlier detection in a data matrix
title_short Application of weighted low rank approximations: outlier detection in a data matrix
title_sort application of weighted low rank approximations outlier detection in a data matrix
topic Criss-cross regression
Data preprocessing
Genotype-by-environment interaction
Exploratory analysis
Atypical elements
url https://doi.org/10.1186/s13104-025-07284-2
work_keys_str_mv AT marisolgarciapena applicationofweightedlowrankapproximationsoutlierdetectioninadatamatrix
AT sergioarciniegasalarcon applicationofweightedlowrankapproximationsoutlierdetectioninadatamatrix
AT kayeebasford applicationofweightedlowrankapproximationsoutlierdetectioninadatamatrix