DPhil the Future: Predicting SARS-CoV-2 mutational landscape
Posted: 15th December 2022
DPhil student Hunar Batra, under the supervision of Professor Peter Minary, looks at the capability of MuFormer to capture phylogenetic and evolutionary properties for generating mutations
Protein-protein interactions between the SARS-CoV-2 spike protein, human receptors, and antibodies are important factors in the virus virulence and ability to escape the human immune system. With the proliferation of SARS-CoV-2 pandemic globally since December 2019, numerous variants have been emerging on a regular basis containing distinct transmission, infection rates, fitness levels, risks and impact over evasion of antibody neutralisation.
The ability of RNA-based coronaviruses to mutate and the possibility of emergence of mutations with higher fitness rate, calls for the need to leverage SARS-CoV-2 proteomic data for anticipating viral features and future alterations to considerably improve disease control, prevention and drug development. Early discovery of high-risk mutations is critical towards undertaking data-informed therapeutic design decisions and effective pandemic management.
In a recent research project by the Computational Biology group, the question of deciphering evolutionary mutations in SARS-CoV-2 spike protein was explored with the introduction of a Machine Learning-based-model MuFormer. As evolution amongst protein structures is mostly neutral and the majority of mutations usually occur within protein sequences, MuFormer leverages both the proteins sequential and geometric space to learn from the encoded evolutionary representations to design mutational sequences in an iterative gradient-based fixed backbone design process.
The proposed model consists of an inverted implementation of AlphaFold2* as a structure prediction oracle used for inverse folding, injected with frozen sequence embeddings from a pre-trained protein language model as an inductive bias. At each design step, MuFormer maximises the likelihood of amino acids appearing at each position in a sequence, resulting in mutation of amino acids at positions with low sequential or structural likelihood. Without any information about the target sequence, MuFormer exploits the phylogenetic information from the sequential space to mutate the protein sequence at each design iteration to fit the backbone atoms configuration with high confidence.
The generated mutational sequences have been validated with historical SARS-CoV-2 data from GISAID*, which exemplified the ability of MuFormer to capture phylogenetic and evolutionary properties for generating mutations. The model was able to mutate Alpha variant’s sequence into Delta and Omicron variants, showcasing the ability of MuFormer to learn evolutionary landscape, with no additional training. MuFormer outperformed vanilla AlphaFold2 by DeepMind for the in-vitro mutagenesis sequence generation task.
While most of the work until now has focussed on evaluating emerging SARS-CoV-2 variants for their fitness levels, MuFormer marks the first model capable of predicting protein sequence mutations directly as well as flagging generated mutations with high fitness rate. The mutational sequence generation capability of MuFormer highlights the ability of transformer-based models to explore the representational language of biology which could assist in controlling spread of diseases by predicting mutations with higher infectivity and fitness in advance.
*1 AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence.
*2 The GISAID Initiative promotes sharing of data from all influenza viruses and COVID-19