Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC b...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0324011 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850269904162258944 |
|---|---|
| author | Leonardo V Castorina Filippo Grazioli Pierre Machart Anja Mösch Federico Errica |
| author_facet | Leonardo V Castorina Filippo Grazioli Pierre Machart Anja Mösch Federico Errica |
| author_sort | Leonardo V Castorina |
| collection | DOAJ |
| description | Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors. |
| format | Article |
| id | doaj-art-cc343ea8e24943e6946e62b22de3ba48 |
| institution | OA Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-cc343ea8e24943e6946e62b22de3ba482025-08-20T01:52:54ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032401110.1371/journal.pone.0324011Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.Leonardo V CastorinaFilippo GrazioliPierre MachartAnja MöschFederico ErricaUnderstanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.https://doi.org/10.1371/journal.pone.0324011 |
| spellingShingle | Leonardo V Castorina Filippo Grazioli Pierre Machart Anja Mösch Federico Errica Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. PLoS ONE |
| title | Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. |
| title_full | Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. |
| title_fullStr | Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. |
| title_full_unstemmed | Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. |
| title_short | Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. |
| title_sort | assessing the generalization capabilities of tcr binding predictors via peptide distance analysis |
| url | https://doi.org/10.1371/journal.pone.0324011 |
| work_keys_str_mv | AT leonardovcastorina assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis AT filippograzioli assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis AT pierremachart assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis AT anjamosch assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis AT federicoerrica assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis |