Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.

Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC b...

Full description

Saved in:
Bibliographic Details
Main Authors: Leonardo V Castorina, Filippo Grazioli, Pierre Machart, Anja Mösch, Federico Errica
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0324011
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850269904162258944
author Leonardo V Castorina
Filippo Grazioli
Pierre Machart
Anja Mösch
Federico Errica
author_facet Leonardo V Castorina
Filippo Grazioli
Pierre Machart
Anja Mösch
Federico Errica
author_sort Leonardo V Castorina
collection DOAJ
description Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.
format Article
id doaj-art-cc343ea8e24943e6946e62b22de3ba48
institution OA Journals
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-cc343ea8e24943e6946e62b22de3ba482025-08-20T01:52:54ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032401110.1371/journal.pone.0324011Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.Leonardo V CastorinaFilippo GrazioliPierre MachartAnja MöschFederico ErricaUnderstanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.https://doi.org/10.1371/journal.pone.0324011
spellingShingle Leonardo V Castorina
Filippo Grazioli
Pierre Machart
Anja Mösch
Federico Errica
Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
PLoS ONE
title Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
title_full Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
title_fullStr Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
title_full_unstemmed Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
title_short Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis.
title_sort assessing the generalization capabilities of tcr binding predictors via peptide distance analysis
url https://doi.org/10.1371/journal.pone.0324011
work_keys_str_mv AT leonardovcastorina assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis
AT filippograzioli assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis
AT pierremachart assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis
AT anjamosch assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis
AT federicoerrica assessingthegeneralizationcapabilitiesoftcrbindingpredictorsviapeptidedistanceanalysis