Informational rescaling of PCA maps with application to genetic distance

Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform...

Full description

Saved in:
Bibliographic Details
Main Authors: Nassim Nicholas Taleb, Pierre Zalloua, Khaled Elbassioni, Haralampos Hatzikirou, Andreas Henschel, Daniel E. Platt
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037024004136
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846117994394025984
author Nassim Nicholas Taleb
Pierre Zalloua
Khaled Elbassioni
Haralampos Hatzikirou
Andreas Henschel
Daniel E. Platt
author_facet Nassim Nicholas Taleb
Pierre Zalloua
Khaled Elbassioni
Haralampos Hatzikirou
Andreas Henschel
Daniel E. Platt
author_sort Nassim Nicholas Taleb
collection DOAJ
description Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as “bits”. We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.
format Article
id doaj-art-a2b3eabe5a6b4ec6bccb2ac57a17da9b
institution Kabale University
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-a2b3eabe5a6b4ec6bccb2ac57a17da9b2024-12-18T08:48:05ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01274856Informational rescaling of PCA maps with application to genetic distanceNassim Nicholas Taleb0Pierre Zalloua1Khaled Elbassioni2Haralampos Hatzikirou3Andreas Henschel4Daniel E. Platt5Risk Engineering, School of Engineering, New York, USA; Maroun Semaan Faculty of Engineering and Architecture, American University of Beirut, Beirut, Lebanon; Corresponding author.College of Medicine and Health Sciences, Dept of Public Health and Epidemiology, Khalifa University, Abu Dhabi, United Arab Emirates; Harvard T. H. Chan School of Public Health, Boston, MA, USACollege of Computing and Mathematical Sciences, Dept. of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates; Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, United Arab EmiratesCollege of Computing and Mathematical Sciences, Dept of Mathematics, Abu Dhabi, United Arab Emirates; Center for Interdisciplinary Digital Sciences (CIDS), Department Information Services and High Performance Computing (ZIH), TUD Dresden University of Technology, Dresden, GermanyCollege of Computing and Mathematical Sciences, Dept. of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates; Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, United Arab EmiratesIBM, New York, NY, USAPrincipal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as “bits”. We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.http://www.sciencedirect.com/science/article/pii/S2001037024004136EntropyMutual informationInformation theoryGenetic distanceGenetic maps
spellingShingle Nassim Nicholas Taleb
Pierre Zalloua
Khaled Elbassioni
Haralampos Hatzikirou
Andreas Henschel
Daniel E. Platt
Informational rescaling of PCA maps with application to genetic distance
Computational and Structural Biotechnology Journal
Entropy
Mutual information
Information theory
Genetic distance
Genetic maps
title Informational rescaling of PCA maps with application to genetic distance
title_full Informational rescaling of PCA maps with application to genetic distance
title_fullStr Informational rescaling of PCA maps with application to genetic distance
title_full_unstemmed Informational rescaling of PCA maps with application to genetic distance
title_short Informational rescaling of PCA maps with application to genetic distance
title_sort informational rescaling of pca maps with application to genetic distance
topic Entropy
Mutual information
Information theory
Genetic distance
Genetic maps
url http://www.sciencedirect.com/science/article/pii/S2001037024004136
work_keys_str_mv AT nassimnicholastaleb informationalrescalingofpcamapswithapplicationtogeneticdistance
AT pierrezalloua informationalrescalingofpcamapswithapplicationtogeneticdistance
AT khaledelbassioni informationalrescalingofpcamapswithapplicationtogeneticdistance
AT haralamposhatzikirou informationalrescalingofpcamapswithapplicationtogeneticdistance
AT andreashenschel informationalrescalingofpcamapswithapplicationtogeneticdistance
AT danieleplatt informationalrescalingofpcamapswithapplicationtogeneticdistance