Informational rescaling of PCA maps with application to genetic distance
Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-01-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037024004136 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1846117994394025984 |
---|---|
author | Nassim Nicholas Taleb Pierre Zalloua Khaled Elbassioni Haralampos Hatzikirou Andreas Henschel Daniel E. Platt |
author_facet | Nassim Nicholas Taleb Pierre Zalloua Khaled Elbassioni Haralampos Hatzikirou Andreas Henschel Daniel E. Platt |
author_sort | Nassim Nicholas Taleb |
collection | DOAJ |
description | Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as “bits”. We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted. |
format | Article |
id | doaj-art-a2b3eabe5a6b4ec6bccb2ac57a17da9b |
institution | Kabale University |
issn | 2001-0370 |
language | English |
publishDate | 2025-01-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj-art-a2b3eabe5a6b4ec6bccb2ac57a17da9b2024-12-18T08:48:05ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01274856Informational rescaling of PCA maps with application to genetic distanceNassim Nicholas Taleb0Pierre Zalloua1Khaled Elbassioni2Haralampos Hatzikirou3Andreas Henschel4Daniel E. Platt5Risk Engineering, School of Engineering, New York, USA; Maroun Semaan Faculty of Engineering and Architecture, American University of Beirut, Beirut, Lebanon; Corresponding author.College of Medicine and Health Sciences, Dept of Public Health and Epidemiology, Khalifa University, Abu Dhabi, United Arab Emirates; Harvard T. H. Chan School of Public Health, Boston, MA, USACollege of Computing and Mathematical Sciences, Dept. of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates; Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, United Arab EmiratesCollege of Computing and Mathematical Sciences, Dept of Mathematics, Abu Dhabi, United Arab Emirates; Center for Interdisciplinary Digital Sciences (CIDS), Department Information Services and High Performance Computing (ZIH), TUD Dresden University of Technology, Dresden, GermanyCollege of Computing and Mathematical Sciences, Dept. of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates; Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, United Arab EmiratesIBM, New York, NY, USAPrincipal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as “bits”. We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.http://www.sciencedirect.com/science/article/pii/S2001037024004136EntropyMutual informationInformation theoryGenetic distanceGenetic maps |
spellingShingle | Nassim Nicholas Taleb Pierre Zalloua Khaled Elbassioni Haralampos Hatzikirou Andreas Henschel Daniel E. Platt Informational rescaling of PCA maps with application to genetic distance Computational and Structural Biotechnology Journal Entropy Mutual information Information theory Genetic distance Genetic maps |
title | Informational rescaling of PCA maps with application to genetic distance |
title_full | Informational rescaling of PCA maps with application to genetic distance |
title_fullStr | Informational rescaling of PCA maps with application to genetic distance |
title_full_unstemmed | Informational rescaling of PCA maps with application to genetic distance |
title_short | Informational rescaling of PCA maps with application to genetic distance |
title_sort | informational rescaling of pca maps with application to genetic distance |
topic | Entropy Mutual information Information theory Genetic distance Genetic maps |
url | http://www.sciencedirect.com/science/article/pii/S2001037024004136 |
work_keys_str_mv | AT nassimnicholastaleb informationalrescalingofpcamapswithapplicationtogeneticdistance AT pierrezalloua informationalrescalingofpcamapswithapplicationtogeneticdistance AT khaledelbassioni informationalrescalingofpcamapswithapplicationtogeneticdistance AT haralamposhatzikirou informationalrescalingofpcamapswithapplicationtogeneticdistance AT andreashenschel informationalrescalingofpcamapswithapplicationtogeneticdistance AT danieleplatt informationalrescalingofpcamapswithapplicationtogeneticdistance |