TCGADownloadHelper: simplifying TCGA data extraction and preprocessing

The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilit...

Full description

Saved in:
Bibliographic Details
Main Authors: Alexandra Anke Baumann, Olaf Wolkenhauer, Markus Wolfien
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-05-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850193274978959360
author Alexandra Anke Baumann
Alexandra Anke Baumann
Olaf Wolkenhauer
Olaf Wolkenhauer
Olaf Wolkenhauer
Markus Wolfien
Markus Wolfien
author_facet Alexandra Anke Baumann
Alexandra Anke Baumann
Olaf Wolkenhauer
Olaf Wolkenhauer
Olaf Wolkenhauer
Markus Wolfien
Markus Wolfien
author_sort Alexandra Anke Baumann
collection DOAJ
description The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.
format Article
id doaj-art-37ae3b8e5d58450ca1c0d693e485df8f
institution OA Journals
issn 1664-8021
language English
publishDate 2025-05-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj-art-37ae3b8e5d58450ca1c0d693e485df8f2025-08-20T02:14:19ZengFrontiers Media S.A.Frontiers in Genetics1664-80212025-05-011610.3389/fgene.2025.15692901569290TCGADownloadHelper: simplifying TCGA data extraction and preprocessingAlexandra Anke Baumann0Alexandra Anke Baumann1Olaf Wolkenhauer2Olaf Wolkenhauer3Olaf Wolkenhauer4Markus Wolfien5Markus Wolfien6Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, GermanyFaculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, GermanyDepartment of Systems Biology and Bioinformatics, University of Rostock, Rostock, GermanyLeibniz-Institute for Food Systems Biology at the Technical University of Munich, Freising, GermanyWallenberg Research Centre, Stellenbosch Institute of Advanced Study, Stellenbosch University, Stellenbosch, South AfricaFaculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, GermanyCenter for Scalable Data Analytics and Artificial Intelligence, Dresden, GermanyThe Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/fullthe cancer genome atlas (TCGA)sample preprocessingJupyter Notebooklung cancergenomic data commons (GDC) portal
spellingShingle Alexandra Anke Baumann
Alexandra Anke Baumann
Olaf Wolkenhauer
Olaf Wolkenhauer
Olaf Wolkenhauer
Markus Wolfien
Markus Wolfien
TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
Frontiers in Genetics
the cancer genome atlas (TCGA)
sample preprocessing
Jupyter Notebook
lung cancer
genomic data commons (GDC) portal
title TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_full TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_fullStr TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_full_unstemmed TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_short TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
title_sort tcgadownloadhelper simplifying tcga data extraction and preprocessing
topic the cancer genome atlas (TCGA)
sample preprocessing
Jupyter Notebook
lung cancer
genomic data commons (GDC) portal
url https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/full
work_keys_str_mv AT alexandraankebaumann tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing
AT alexandraankebaumann tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing
AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing
AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing
AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing
AT markuswolfien tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing
AT markuswolfien tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing