A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets

Abstract Background Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variation...

Full description

Saved in:
Bibliographic Details
Main Authors: Eonyong Han, Hwijun Kwon, Inuk Jung
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11925-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849226572845809664
author Eonyong Han
Hwijun Kwon
Inuk Jung
author_facet Eonyong Han
Hwijun Kwon
Inuk Jung
author_sort Eonyong Han
collection DOAJ
description Abstract Background Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variations in measurement units, sample numbers, and features. Currently, there is a lack of generalized guidelines for making decisions in multi-omics study design (MOSD), such as selecting an appropriate number of samples and features, type of preprocessing and integration for robust analysis results. We propose a suggestive guideline for MOSD, involving nine important factors: sample size, feature selection, preprocessing strategy, noise characterization, class balance, number of classes, cancer subtype combination, omics combination, and clinical features. Results To assess the effectiveness of our proposed MOSD guidelines, we designed and conducted seven benchmark tests using 10 clustering methods on various TCGA cancer datasets with an objective of clustering cancer subtypes. The results indicated robust performance in terms of cancer subtype discrimination when adhering to the following criteria: 26 or more samples per class, selecting less than 10% of omics features, maintaining a sample balance under a 3:1 ratio, and keeping the noise level below 30%. Feature selection was particularly important, improving clustering performance by 34%. Conclusion These findings provide evidence-based recommendations for MOSD, enabling researchers to optimize analytical approaches and enhance the reliability of results across cancer datasets. The proposed MOSD framework offers a suggestive guideline addressing both computational and biological factors for multi-omics data integration.
format Article
id doaj-art-5dccffe3007b412193224e1ca17bfc3d
institution Kabale University
issn 1471-2164
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-5dccffe3007b412193224e1ca17bfc3d2025-08-24T11:09:31ZengBMCBMC Genomics1471-21642025-08-0126111910.1186/s12864-025-11925-yA review on multi-omics integration for aiding study design of large scale TCGA cancer datasetsEonyong Han0Hwijun Kwon1Inuk Jung2School of Computer Science and Engineering, Kyungpook National UniversitySchool of Computer Science and Engineering, Kyungpook National UniversitySchool of Computer Science and Engineering, Kyungpook National UniversityAbstract Background Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variations in measurement units, sample numbers, and features. Currently, there is a lack of generalized guidelines for making decisions in multi-omics study design (MOSD), such as selecting an appropriate number of samples and features, type of preprocessing and integration for robust analysis results. We propose a suggestive guideline for MOSD, involving nine important factors: sample size, feature selection, preprocessing strategy, noise characterization, class balance, number of classes, cancer subtype combination, omics combination, and clinical features. Results To assess the effectiveness of our proposed MOSD guidelines, we designed and conducted seven benchmark tests using 10 clustering methods on various TCGA cancer datasets with an objective of clustering cancer subtypes. The results indicated robust performance in terms of cancer subtype discrimination when adhering to the following criteria: 26 or more samples per class, selecting less than 10% of omics features, maintaining a sample balance under a 3:1 ratio, and keeping the noise level below 30%. Feature selection was particularly important, improving clustering performance by 34%. Conclusion These findings provide evidence-based recommendations for MOSD, enabling researchers to optimize analytical approaches and enhance the reliability of results across cancer datasets. The proposed MOSD framework offers a suggestive guideline addressing both computational and biological factors for multi-omics data integration.https://doi.org/10.1186/s12864-025-11925-yMulti-omicsIntegrationStudy designMachine learning
spellingShingle Eonyong Han
Hwijun Kwon
Inuk Jung
A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
BMC Genomics
Multi-omics
Integration
Study design
Machine learning
title A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_full A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_fullStr A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_full_unstemmed A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_short A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets
title_sort review on multi omics integration for aiding study design of large scale tcga cancer datasets
topic Multi-omics
Integration
Study design
Machine learning
url https://doi.org/10.1186/s12864-025-11925-y
work_keys_str_mv AT eonyonghan areviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets
AT hwijunkwon areviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets
AT inukjung areviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets
AT eonyonghan reviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets
AT hwijunkwon reviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets
AT inukjung reviewonmultiomicsintegrationforaidingstudydesignoflargescaletcgacancerdatasets