High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks

Abstract Engineered T-cell receptor (eTCR) systems rely on accurately generated T-cell receptor (TCR) sequences to enhance immunotherapy predictability and efficacy. The most variable and crucial part of the TCR receptor is the CDR3 sequence region. Current methods for generating CDR3 sequences, inc...

Full description

Saved in:
Bibliographic Details
Main Authors: Piotr Religa, Michel-Edwar Mickael, Norwin Kubick, Jarosław Olav Horbańczuk, Nikko Floretes, Mariusz Sacharczuk, Atanas G. Atanasov
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-01172-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849687936555024384
author Piotr Religa
Michel-Edwar Mickael
Norwin Kubick
Jarosław Olav Horbańczuk
Nikko Floretes
Mariusz Sacharczuk
Atanas G. Atanasov
author_facet Piotr Religa
Michel-Edwar Mickael
Norwin Kubick
Jarosław Olav Horbańczuk
Nikko Floretes
Mariusz Sacharczuk
Atanas G. Atanasov
author_sort Piotr Religa
collection DOAJ
description Abstract Engineered T-cell receptor (eTCR) systems rely on accurately generated T-cell receptor (TCR) sequences to enhance immunotherapy predictability and efficacy. The most variable and crucial part of the TCR receptor is the CDR3 sequence region. Current methods for generating CDR3 sequences, including motif-based and Markov models, struggle to generate reliable, diverse, and novel TCR sequences. In this study, we present the first application of Generative Adversarial Networks (GANs) for producing biologically reliable CDR3 sequences, using Long Short-Term Memory (LSTM) networks and LeakyReLU-based GANs. Our results show that LSTM models generate more diverse sequences with higher accuracy, lower discriminator loss, and higher AUC compared to LeakyReLU. However, LeakyReLU provides greater stability with a lower generator loss, achieving a total Pearson correlation score of over 0.9. Both models demonstrate the ability to produce highly realistic TCR sequences, as validated by t-SNE clustering, frequency distribution analysis, TCRd3 BLAST analysis, and in silico docking. These findings highlight the potential of GANs as a powerful tool for generating synthetic yet biologically relevant TCR sequences, a crucial step toward improving eTCR-based therapies. Further refinement of amino acid frequency distributions and clinical validation will enhance their applicability for therapeutic purposes.
format Article
id doaj-art-2810ceccc32c4b43be98db87302f075f
institution DOAJ
issn 2045-2322
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-2810ceccc32c4b43be98db87302f075f2025-08-20T03:22:12ZengNature PortfolioScientific Reports2045-23222025-05-0115111310.1038/s41598-025-01172-2High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networksPiotr Religa0Michel-Edwar Mickael1Norwin Kubick2Jarosław Olav Horbańczuk3Nikko Floretes4Mariusz Sacharczuk5Atanas G. Atanasov6Department of Medicine, Karolinska InstituteInstitute of Genetics and Animal Biotechnology, Polish Academy of SciencesDepartment of Biology, Institute of Plant Science and Microbiology, University of HamburgDepartment of Medicine, Karolinska InstituteCollege of Engineering, Samar State UniversityDepartment of Pharmacodynamics, Faculty of Pharmacy, Medical University of WarsawInstitute of Genetics and Animal Biotechnology, Polish Academy of SciencesAbstract Engineered T-cell receptor (eTCR) systems rely on accurately generated T-cell receptor (TCR) sequences to enhance immunotherapy predictability and efficacy. The most variable and crucial part of the TCR receptor is the CDR3 sequence region. Current methods for generating CDR3 sequences, including motif-based and Markov models, struggle to generate reliable, diverse, and novel TCR sequences. In this study, we present the first application of Generative Adversarial Networks (GANs) for producing biologically reliable CDR3 sequences, using Long Short-Term Memory (LSTM) networks and LeakyReLU-based GANs. Our results show that LSTM models generate more diverse sequences with higher accuracy, lower discriminator loss, and higher AUC compared to LeakyReLU. However, LeakyReLU provides greater stability with a lower generator loss, achieving a total Pearson correlation score of over 0.9. Both models demonstrate the ability to produce highly realistic TCR sequences, as validated by t-SNE clustering, frequency distribution analysis, TCRd3 BLAST analysis, and in silico docking. These findings highlight the potential of GANs as a powerful tool for generating synthetic yet biologically relevant TCR sequences, a crucial step toward improving eTCR-based therapies. Further refinement of amino acid frequency distributions and clinical validation will enhance their applicability for therapeutic purposes.https://doi.org/10.1038/s41598-025-01172-2
spellingShingle Piotr Religa
Michel-Edwar Mickael
Norwin Kubick
Jarosław Olav Horbańczuk
Nikko Floretes
Mariusz Sacharczuk
Atanas G. Atanasov
High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks
Scientific Reports
title High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks
title_full High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks
title_fullStr High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks
title_full_unstemmed High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks
title_short High-fidelity in silico generation and augmentation of TCR repertoire data using generative adversarial networks
title_sort high fidelity in silico generation and augmentation of tcr repertoire data using generative adversarial networks
url https://doi.org/10.1038/s41598-025-01172-2
work_keys_str_mv AT piotrreliga highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks
AT micheledwarmickael highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks
AT norwinkubick highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks
AT jarosławolavhorbanczuk highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks
AT nikkofloretes highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks
AT mariuszsacharczuk highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks
AT atanasgatanasov highfidelityinsilicogenerationandaugmentationoftcrrepertoiredatausinggenerativeadversarialnetworks