Using evolutionary sequence variation to make inferences about protein structure and function
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. The explosive growth in the number of available protein sequences raises the possibility of using the natural variation present in homologous protein sequences to infer these constraints and thus identify residues that control different protein phenotypes. Because in many cases phenotypic changes are controlled by more than one amino acid, the mutations that separate one phenotype from another may not be independent, requiring us to understand the correlation structure of the data.
The challenge is to distinguish true interactions from the noisy and under-sampled set of observed correlations in a large multiple sequence alignment. To address this we build a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair interactions. We translate these interactions into pairwise distance constraints between amino acids and use them to generate all atom structural models. Using proteins of known structure we show that correlations between amino acids at different sites in a protein contain sufficient information to predict low resolution tertiary protein structure of both globular and transmembrane proteins. We then apply our method to predict de novo the structure of 11 medically important transmembrane proteins of unknown structure. In addition we are able to predict protein quaternary structure and alternative conformations. The next step requires development of a theoretical inference framework that enables the relationship between the amount of available input data and the reliability of structural predictions to be better understood.