Super Partition: fast, flexible, and interpretable large-scale data reduction in R

Motivation As data sets increase in size and complexity with advancing technology, flexible and interpretable data reduction methods that quantify information preservation become increasingly important. Results Super Partition is a large-scale approximation of the original Partition data reduction a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Katelyn J. Queen, Malcolm Barrett, Joshua Millstein
Format:	Article
Language:	English
Published:	PeerJ Inc. 2025-01-01
Series:	PeerJ
Subjects:	Data reduction Clustering Big data
Online Access:	https://peerj.com/articles/18580.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832582530656632832
author	Katelyn J. Queen Malcolm Barrett Joshua Millstein
author_facet	Katelyn J. Queen Malcolm Barrett Joshua Millstein
author_sort	Katelyn J. Queen
collection	DOAJ
description	Motivation As data sets increase in size and complexity with advancing technology, flexible and interpretable data reduction methods that quantify information preservation become increasingly important. Results Super Partition is a large-scale approximation of the original Partition data reduction algorithm that allows the user to flexibly specify the minimum amount of information captured for each input feature. In an initial step, Genie, a fast, hierarchical clustering algorithm, forms a super-partition, thereby increasing the computational tractability by allowing Partition to be applied to the subsets. Applications to high dimensional data sets show scalability to hundreds of thousands of features with reasonable computation times. Availability and implementation Super Partition is a new function within the partition R package, available on the CRAN repository (https://cran.r-project.org/web/packages/partition/index.html).
format	Article
id	doaj-art-a02f84d6c47242f2b80e5152b8dc9244
institution	Kabale University
issn	2167-8359
language	English
publishDate	2025-01-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ
spelling	doaj-art-a02f84d6c47242f2b80e5152b8dc92442025-01-29T15:05:18ZengPeerJ Inc.PeerJ2167-83592025-01-0113e1858010.7717/peerj.18580Super Partition: fast, flexible, and interpretable large-scale data reduction in RKatelyn J. Queen0Malcolm Barrett1Joshua Millstein2Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, United StatesDepartment of Health Policy, Stanford University, Stanford, California, United StatesDepartment of Population and Public Health Sciences, University of Southern California, Los Angeles, California, United StatesMotivation As data sets increase in size and complexity with advancing technology, flexible and interpretable data reduction methods that quantify information preservation become increasingly important. Results Super Partition is a large-scale approximation of the original Partition data reduction algorithm that allows the user to flexibly specify the minimum amount of information captured for each input feature. In an initial step, Genie, a fast, hierarchical clustering algorithm, forms a super-partition, thereby increasing the computational tractability by allowing Partition to be applied to the subsets. Applications to high dimensional data sets show scalability to hundreds of thousands of features with reasonable computation times. Availability and implementation Super Partition is a new function within the partition R package, available on the CRAN repository (https://cran.r-project.org/web/packages/partition/index.html).https://peerj.com/articles/18580.pdfData reductionClusteringBig data
spellingShingle	Katelyn J. Queen Malcolm Barrett Joshua Millstein Super Partition: fast, flexible, and interpretable large-scale data reduction in R PeerJ Data reduction Clustering Big data
title	Super Partition: fast, flexible, and interpretable large-scale data reduction in R
title_full	Super Partition: fast, flexible, and interpretable large-scale data reduction in R
title_fullStr	Super Partition: fast, flexible, and interpretable large-scale data reduction in R
title_full_unstemmed	Super Partition: fast, flexible, and interpretable large-scale data reduction in R
title_short	Super Partition: fast, flexible, and interpretable large-scale data reduction in R
title_sort	super partition fast flexible and interpretable large scale data reduction in r
topic	Data reduction Clustering Big data
url	https://peerj.com/articles/18580.pdf
work_keys_str_mv	AT katelynjqueen superpartitionfastflexibleandinterpretablelargescaledatareductioninr AT malcolmbarrett superpartitionfastflexibleandinterpretablelargescaledatareductioninr AT joshuamillstein superpartitionfastflexibleandinterpretablelargescaledatareductioninr

Super Partition: fast, flexible, and interpretable large-scale data reduction in R

Similar Items