TCGADownloadHelper: simplifying TCGA data extraction and preprocessing
The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilit...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-05-01
|
| Series: | Frontiers in Genetics |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850193274978959360 |
|---|---|
| author | Alexandra Anke Baumann Alexandra Anke Baumann Olaf Wolkenhauer Olaf Wolkenhauer Olaf Wolkenhauer Markus Wolfien Markus Wolfien |
| author_facet | Alexandra Anke Baumann Alexandra Anke Baumann Olaf Wolkenhauer Olaf Wolkenhauer Olaf Wolkenhauer Markus Wolfien Markus Wolfien |
| author_sort | Alexandra Anke Baumann |
| collection | DOAJ |
| description | The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies. |
| format | Article |
| id | doaj-art-37ae3b8e5d58450ca1c0d693e485df8f |
| institution | OA Journals |
| issn | 1664-8021 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Genetics |
| spelling | doaj-art-37ae3b8e5d58450ca1c0d693e485df8f2025-08-20T02:14:19ZengFrontiers Media S.A.Frontiers in Genetics1664-80212025-05-011610.3389/fgene.2025.15692901569290TCGADownloadHelper: simplifying TCGA data extraction and preprocessingAlexandra Anke Baumann0Alexandra Anke Baumann1Olaf Wolkenhauer2Olaf Wolkenhauer3Olaf Wolkenhauer4Markus Wolfien5Markus Wolfien6Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, GermanyFaculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, GermanyDepartment of Systems Biology and Bioinformatics, University of Rostock, Rostock, GermanyLeibniz-Institute for Food Systems Biology at the Technical University of Munich, Freising, GermanyWallenberg Research Centre, Stellenbosch Institute of Advanced Study, Stellenbosch University, Stellenbosch, South AfricaFaculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, GermanyCenter for Scalable Data Analytics and Artificial Intelligence, Dresden, GermanyThe Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal’s cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36-character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumann-ur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/fullthe cancer genome atlas (TCGA)sample preprocessingJupyter Notebooklung cancergenomic data commons (GDC) portal |
| spellingShingle | Alexandra Anke Baumann Alexandra Anke Baumann Olaf Wolkenhauer Olaf Wolkenhauer Olaf Wolkenhauer Markus Wolfien Markus Wolfien TCGADownloadHelper: simplifying TCGA data extraction and preprocessing Frontiers in Genetics the cancer genome atlas (TCGA) sample preprocessing Jupyter Notebook lung cancer genomic data commons (GDC) portal |
| title | TCGADownloadHelper: simplifying TCGA data extraction and preprocessing |
| title_full | TCGADownloadHelper: simplifying TCGA data extraction and preprocessing |
| title_fullStr | TCGADownloadHelper: simplifying TCGA data extraction and preprocessing |
| title_full_unstemmed | TCGADownloadHelper: simplifying TCGA data extraction and preprocessing |
| title_short | TCGADownloadHelper: simplifying TCGA data extraction and preprocessing |
| title_sort | tcgadownloadhelper simplifying tcga data extraction and preprocessing |
| topic | the cancer genome atlas (TCGA) sample preprocessing Jupyter Notebook lung cancer genomic data commons (GDC) portal |
| url | https://www.frontiersin.org/articles/10.3389/fgene.2025.1569290/full |
| work_keys_str_mv | AT alexandraankebaumann tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT alexandraankebaumann tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT olafwolkenhauer tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT markuswolfien tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing AT markuswolfien tcgadownloadhelpersimplifyingtcgadataextractionandpreprocessing |