Skip to main content

Protein language modeling - understanding the language of life (joint project with Harvard Medical School)

Supervisors

Suitable for

MSc in Advanced Computer Science
Mathematics and Computer Science, Part C
Computer Science and Philosophy, Part C
Computer Science, Part C

Abstract

Proteins leverage the genetic information encoded in DNA to drive the functioning of all organisms around us. Composed as a sequence of amino acids, their structure is extremely expressive and diverse. Deep learning techniques inspired from natural language processing methods have recently been very successful at implicitly teasing out the constraints underpinning these structures by posing the problem as a language modeling task.

This project aims at reaching a finer understanding of the representations learnt by these models to help answer key questions in computational genomics, from uncovering meaningful clusters within or across protein families, to a better understanding of the viral evolution process.

You will get exposure to different deep learning architectures (e.g., VAE, transformers), as well as  techniques in dimensionality reduction, latent space visualization and clustering.

This project is a joint collaboration between OATML (https://oatml.cs.ox.ac.uk/) and the Marks lab (https://www.deboramarkslab.c

Prerequisites: * strong python experience 

                    * experience with deep learning, generative models, sequence models