A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets

Abstract Background Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variation...

Full description

Saved in:

Bibliographic Details
Main Authors:	Eonyong Han, Hwijun Kwon, Inuk Jung
Format:	Article
Language:	English
Published:	BMC 2025-08-01
Series:	BMC Genomics
Subjects:	Multi-omics Integration Study design Machine learning
Online Access:	https://doi.org/10.1186/s12864-025-11925-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849226572845809664
author	Eonyong Han Hwijun Kwon Inuk Jung
author_facet	Eonyong Han Hwijun Kwon Inuk Jung
author_sort	Eonyong Han
collection	DOAJ
description	Abstract Background Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variations in measurement units, sample numbers, and features. Currently, there is a lack of generalized guidelines for making decisions in multi-omics study design (MOSD), such as selecting an appropriate number of samples and features, type of preprocessing and integration for robust analysis results. We propose a suggestive guideline for MOSD, involving nine important factors: sample size, feature selection, preprocessing strategy, noise characterization, class balance, number of classes, cancer subtype combination, omics combination, and clinical features. Results To assess the effectiveness of our proposed MOSD guidelines, we designed and conducted seven benchmark tests using 10 clustering methods on various TCGA cancer datasets with an objective of clustering cancer subtypes. The results indicated robust performance in terms of cancer subtype discrimination when adhering to the following criteria: 26 or more samples per class, selecting less than 10% of omics features, maintaining a sample balance under a 3:1 ratio, and keeping the noise level below 30%. Feature selection was particularly important, improving clustering performance by 34%. Conclusion These findings provide evidence-based recommendations for MOSD, enabling researchers to optimize analytical approaches and enhance the reliability of results across cancer datasets. The proposed MOSD framework offers a suggestive guideline addressing both computational and biological factors for multi-omics data integration.
format	Article
id	doaj-art-5dccffe3007b412193224e1ca17bfc3d
institution	Kabale University
issn	1471-2164
language	English
publishDate	2025-08-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj-art-5dccffe3007b412193224e1ca17bfc3d2025-08-24T11:09:31ZengBMCBMC Genomics1471-21642025-08-0126111910.1186/s12864-025-11925-yA review on multi-omics integration for aiding study design of large scale TCGA cancer datasetsEonyong Han0Hwijun Kwon1Inuk Jung2School of Computer Science and Engineering, Kyungpook National UniversitySchool of Computer Science and Engineering, Kyungpook National UniversitySchool of Computer Science and Engineering, Kyungpook National UniversityAbstract Background Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variations in measurement units, sample numbers, and features. Currently, there is a lack of generalized guidelines for making decisions in multi-omics study design (MOSD), such as selecting an appropriate number of samples and features, type of preprocessing and integration for robust analysis results. We propose a suggestive guideline for MOSD, involving nine important factors: sample size, feature selection, preprocessing strategy, noise characterization, class balance, number of classes, cancer subtype combination, omics combination, and clinical features. Results To assess the effectiveness of our proposed MOSD guidelines, we designed and conducted seven benchmark tests using 10 clustering methods on various TCGA cancer datasets with an objective of clustering cancer subtypes. The results indicated robust performance in terms of cancer subtype discrimination when adhering to the following criteria: 26 or more samples per class, selecting less than 10% of omics features, maintaining a sample balance under a 3:1 ratio, and keeping the noise level below 30%. Feature selection was particularly important, improving clustering performance by 34%. Conclusion These findings provide evidence-based recommendations for MOSD, enabling researchers to optimize analytical approaches and enhance the reliability of results across cancer datasets. The proposed MOSD framework offers a suggestive guideline addressing both computational and biological factors for multi-omics data integration.https://doi.org/10.1186/s12864-025-11925-yMulti-omicsIntegrationStudy designMachine learning
spellingShingle	Eonyong Han Hwijun Kwon Inuk Jung A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets BMC Genomics Multi-omics Integration Study design Machine learning
title	A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_full	A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_fullStr	A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_full_unstemmed	A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_short	A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_sort	review on multi omics integration for aiding study design of large scale tcga cancer datasets
topic	Multi-omics Integration Study design Machine learning
url	https://doi.org/10.1186/s12864-025-11925-y
work_keys_str_mv	AT eonyonghan areviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets AT hwijunkwon areviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets AT inukjung areviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets AT eonyonghan reviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets AT hwijunkwon reviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets AT inukjung reviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets

A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets

Similar Items