TCGADownloadHelper: simplifying TCGA data extraction and preprocessing

The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilit...

Full description

Saved in:

Bibliographic Details
Main Authors:	Alexandra Anke Baumann, Olaf Wolkenhauer, Markus Wolfien
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-05-01
Series:	Frontiers in Genetics
Subjects:	the cancer genome atlas (TCGA) sample preprocessing Jupyter Notebook lung cancer genomic data commons (GDC) portal
Online Access:	https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850193274978959360
author	Alexandra Anke Baumann Alexandra Anke Baumann Olaf Wolkenhauer Olaf Wolkenhauer Olaf Wolkenhauer Markus Wolfien Markus Wolfien
author_facet	Alexandra Anke Baumann Alexandra Anke Baumann Olaf Wolkenhauer Olaf Wolkenhauer Olaf Wolkenhauer Markus Wolfien Markus Wolfien
author_sort	Alexandra Anke Baumann
collection	DOAJ
description	The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.
format	Article
id	doaj-art-37ae3b8e5d58450ca1c0d693e485df8f
institution	OA Journals
issn	1664-8021
language	English
publishDate	2025-05-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Genetics
spelling	doaj-art-37ae3b8e5d58450ca1c0d693e485df8f2025-08-20T02:14:19ZengFrontiers Media S.A.Frontiers in Genetics1664-80212025-05-011610.3389/fgene.2025.15692901569290TCGADownloadHelper: simplifying TCGA data extraction and preprocessingAlexandra Anke Baumann0Alexandra Anke Baumann1Olaf Wolkenhauer2Olaf Wolkenhauer3Olaf Wolkenhauer4Markus Wolfien5Markus Wolfien6Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, GermanyFaculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, GermanyDepartment of Systems Biology and Bioinformatics, University of Rostock, Rostock, GermanyLeibniz-Institute for Food Systems Biology at the Technical University of Munich, Freising, GermanyWallenberg Research Centre, Stellenbosch Institute of Advanced Study, Stellenbosch University, Stellenbosch, South AfricaFaculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, GermanyCenter for Scalable Data Analytics and Artificial Intelligence, Dresden, GermanyThe Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/fullthe cancer genome atlas (TCGA)sample preprocessingJupyter Notebooklung cancergenomic data commons (GDC) portal
spellingShingle	Alexandra Anke Baumann Alexandra Anke Baumann Olaf Wolkenhauer Olaf Wolkenhauer Olaf Wolkenhauer Markus Wolfien Markus Wolfien TCGADownloadHelper: simplifying TCGA data extraction and preprocessing Frontiers in Genetics the cancer genome atlas (TCGA) sample preprocessing Jupyter Notebook lung cancer genomic data commons (GDC) portal
title	TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_full	TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_fullStr	TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_full_unstemmed	TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_short	TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_sort	tcgadownloadhelper simplifying tcga data extraction and preprocessing
topic	the cancer genome atlas (TCGA) sample preprocessing Jupyter Notebook lung cancer genomic data commons (GDC) portal
url	https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/full
work_keys_str_mv	AT alexandraankebaumann tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT alexandraankebaumann tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT markuswolfien tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT markuswolfien tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing

TCGADownloadHelper: simplifying TCGA data extraction and preprocessing

Similar Items