A predictive language model for SARS-CoV-2 evolution

Abstract Modeling and predicting mutations are critical for COVID-19 and similar pandemic preparedness. However, existing predictive models have yet to integrate the regularity and randomness of viral mutations with minimal data requirements. Here, we develop a non-demanding language model utilizing...

Full description

Saved in:
Bibliographic Details
Main Authors: Enhao Ma, Xuan Guo, Mingda Hu, Penghua Wang, Xin Wang, Congwen Wei, Gong Cheng
Format: Article
Language:English
Published: Nature Publishing Group 2024-12-01
Series:Signal Transduction and Targeted Therapy
Online Access:https://doi.org/10.1038/s41392-024-02066-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850103011514253312
author Enhao Ma
Xuan Guo
Mingda Hu
Penghua Wang
Xin Wang
Congwen Wei
Gong Cheng
author_facet Enhao Ma
Xuan Guo
Mingda Hu
Penghua Wang
Xin Wang
Congwen Wei
Gong Cheng
author_sort Enhao Ma
collection DOAJ
description Abstract Modeling and predicting mutations are critical for COVID-19 and similar pandemic preparedness. However, existing predictive models have yet to integrate the regularity and randomness of viral mutations with minimal data requirements. Here, we develop a non-demanding language model utilizing both regularity and randomness to predict candidate SARS-CoV-2 variants and mutations that might prevail. We constructed the “grammatical frameworks” of the available S1 sequences for dimension reduction and semantic representation to grasp the model’s latent regularity. The mutational profile, defined as the frequency of mutations, was introduced into the model to incorporate randomness. With this model, we successfully identified and validated several variants with significantly enhanced viral infectivity and immune evasion by wet-lab experiments. By inputting the sequence data from three different time points, we detected circulating strains or vital mutations for XBB.1.16, EG.5, JN.1, and BA.2.86 strains before their emergence. In addition, our results also predicted the previously unknown variants that may cause future epidemics. With both the data validation and experiment evidence, our study represents a fast-responding, concise, and promising language model, potentially generalizable to other viral pathogens, to forecast viral evolution and detect crucial hot mutation spots, thus warning the emerging variants that might raise public health concern.
format Article
id doaj-art-55f8ac94276844f38ffc36f4e51b0d31
institution DOAJ
issn 2059-3635
language English
publishDate 2024-12-01
publisher Nature Publishing Group
record_format Article
series Signal Transduction and Targeted Therapy
spelling doaj-art-55f8ac94276844f38ffc36f4e51b0d312025-08-20T02:39:38ZengNature Publishing GroupSignal Transduction and Targeted Therapy2059-36352024-12-019111710.1038/s41392-024-02066-xA predictive language model for SARS-CoV-2 evolutionEnhao Ma0Xuan Guo1Mingda Hu2Penghua Wang3Xin Wang4Congwen Wei5Gong Cheng6School of Basic Medical Science, Tsinghua UniversitySchool of Basic Medical Science, Tsinghua UniversityBeijing Institute of BiotechnologyDepartment of Immunology, School of Medicine, University of Connecticut Health CenterBeijing Institute of BiotechnologyBeijing Institute of BiotechnologySchool of Basic Medical Science, Tsinghua UniversityAbstract Modeling and predicting mutations are critical for COVID-19 and similar pandemic preparedness. However, existing predictive models have yet to integrate the regularity and randomness of viral mutations with minimal data requirements. Here, we develop a non-demanding language model utilizing both regularity and randomness to predict candidate SARS-CoV-2 variants and mutations that might prevail. We constructed the “grammatical frameworks” of the available S1 sequences for dimension reduction and semantic representation to grasp the model’s latent regularity. The mutational profile, defined as the frequency of mutations, was introduced into the model to incorporate randomness. With this model, we successfully identified and validated several variants with significantly enhanced viral infectivity and immune evasion by wet-lab experiments. By inputting the sequence data from three different time points, we detected circulating strains or vital mutations for XBB.1.16, EG.5, JN.1, and BA.2.86 strains before their emergence. In addition, our results also predicted the previously unknown variants that may cause future epidemics. With both the data validation and experiment evidence, our study represents a fast-responding, concise, and promising language model, potentially generalizable to other viral pathogens, to forecast viral evolution and detect crucial hot mutation spots, thus warning the emerging variants that might raise public health concern.https://doi.org/10.1038/s41392-024-02066-x
spellingShingle Enhao Ma
Xuan Guo
Mingda Hu
Penghua Wang
Xin Wang
Congwen Wei
Gong Cheng
A predictive language model for SARS-CoV-2 evolution
Signal Transduction and Targeted Therapy
title A predictive language model for SARS-CoV-2 evolution
title_full A predictive language model for SARS-CoV-2 evolution
title_fullStr A predictive language model for SARS-CoV-2 evolution
title_full_unstemmed A predictive language model for SARS-CoV-2 evolution
title_short A predictive language model for SARS-CoV-2 evolution
title_sort predictive language model for sars cov 2 evolution
url https://doi.org/10.1038/s41392-024-02066-x
work_keys_str_mv AT enhaoma apredictivelanguagemodelforsarscov2evolution
AT xuanguo apredictivelanguagemodelforsarscov2evolution
AT mingdahu apredictivelanguagemodelforsarscov2evolution
AT penghuawang apredictivelanguagemodelforsarscov2evolution
AT xinwang apredictivelanguagemodelforsarscov2evolution
AT congwenwei apredictivelanguagemodelforsarscov2evolution
AT gongcheng apredictivelanguagemodelforsarscov2evolution
AT enhaoma predictivelanguagemodelforsarscov2evolution
AT xuanguo predictivelanguagemodelforsarscov2evolution
AT mingdahu predictivelanguagemodelforsarscov2evolution
AT penghuawang predictivelanguagemodelforsarscov2evolution
AT xinwang predictivelanguagemodelforsarscov2evolution
AT congwenwei predictivelanguagemodelforsarscov2evolution
AT gongcheng predictivelanguagemodelforsarscov2evolution