Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.

The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hos...

Full description

Saved in:
Bibliographic Details
Main Authors: Liam Brierley, Anna Fowler
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2021-04-01
Series:PLoS Pathogens
Online Access:https://journals.plos.org/plospathogens/article/file?id=10.1371/journal.ppat.1009149&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850128469985329152
author Liam Brierley
Anna Fowler
author_facet Liam Brierley
Anna Fowler
author_sort Liam Brierley
collection DOAJ
description The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.
format Article
id doaj-art-72ba592c7af64a34ac4e836ac5591b7d
institution OA Journals
issn 1553-7366
1553-7374
language English
publishDate 2021-04-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Pathogens
spelling doaj-art-72ba592c7af64a34ac4e836ac5591b7d2025-08-20T02:33:18ZengPublic Library of Science (PLoS)PLoS Pathogens1553-73661553-73742021-04-01174e100914910.1371/journal.ppat.1009149Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.Liam BrierleyAnna FowlerThe COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.https://journals.plos.org/plospathogens/article/file?id=10.1371/journal.ppat.1009149&type=printable
spellingShingle Liam Brierley
Anna Fowler
Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
PLoS Pathogens
title Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
title_full Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
title_fullStr Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
title_full_unstemmed Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
title_short Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
title_sort predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
url https://journals.plos.org/plospathogens/article/file?id=10.1371/journal.ppat.1009149&type=printable
work_keys_str_mv AT liambrierley predictingtheanimalhostsofcoronavirusesfromcompositionalbiasesofspikeproteinandwholegenomesequencesthroughmachinelearning
AT annafowler predictingtheanimalhostsofcoronavirusesfromcompositionalbiasesofspikeproteinandwholegenomesequencesthroughmachinelearning