# Computational Biology: 2022-2023

| |

| Schedule C1 (CS&P) — Computer Science and Philosophy Schedule C1 — Computer Science |

| Michaelmas Term 2022 (20 lectures) |

## Overview

This course is intended for students who want to understand modern computational (molecular/structural) biology. The course will provide an introduction to the central dogma of molecular biology and will cover fundamental methods for sequence and structure analysis as well as some essential concepts from statistical mechanics. It will then discuss algorithmic approaches for deciphering the relationship between sequence and structure and techniques to model biomolecular structures. The course will present examples for the regulation of the information flow in molecular biology, the effect of epigenetics and computational methods facilitating genome editing applications.

## Learning outcomes

The lectures outlined below have been designed so that students will (be able to):

- have a basic understanding of (the central dogma of) molecular biology;
- gain familiarity with computational methods for biological sequence and structure analysis;
- acquire the basic concepts of statistical mechanics in order to provide the necessary background for understanding the major challenges of physically realistic modelling/simulation of biomolecular events (such as protein folding);
- formulate (and propose algorithmic solutions for) the problem of predicting structure (output) from biological sequence (input) and the inverse problem of finding sequences that fold into a given structure;
- understand the importance (and limitations) of 3D structure (input) modelling methods to obtain dynamical and/or statistical information (output) such as distributions of certain observables and acquire knowledge of advanced (conformational) sampling and optimization algorithms and approaches for dimensionality reduction;
- gain a basic understanding of the regulation of the genetic information and knowledge about associated computational problems;
- become familiar with some computational methods (prediction algorithms) that facilitate genome editing applications.

## Synopsis

The course is structured as follows, 18 lectures (+2 guest lectures):** **

**Lecture 1 Genetic material and flow of genetic Information. ** Introduction to the cell (as the basic unit of all living organisms) and the three types of molecules (DNA, RNA, proteins) all life depends on. Genetic material and genes. Introduction to the central dogma of molecular biology (DNA -> mRNA -> Proteins) and the way biological information is stored in DNA, transcribed to mRNA and translated into Proteins, the functional molecular machines of life.

**Lecture 2 Restriction mapping (of DNA). ** Introduction to restriction mapping. The Partial Digest Problem (or Turnpike problem in CS). Discussion and analysis of brute force and practical algorithms to solve the Partial Digest Problem. Introduction to the Double Digest Problem and related algorithms.

**Lectures 3-4 Analysing nucleic acid and protein sequences. **DNA sequence assembly. The Motif Finding Problem. Feature extraction from biological sequences. The importance of sequence comparison. Hamming distance, edit distance, the alignment of two sequences and alignment scoring. Longest common subsequence. Global and local sequence alignments. Common matrices for protein sequence comparison (BLOSUM and PAM). Multiple sequence alignment.

**Lectures 5-7 Nucleic acid and protein structure (analysis). **The building blocks of proteins. Protein (super)secondary, tertiary and quaternary structures. Protein structure comparison. Protein structure classification (databases). Protein fold space. The Protein Data Bank. Deriving knowledge based potential by the analysis of known protein structures. Normal mode analysis of protein structures. Basic building blocks of nucleic acids. RNA (secondary) structure. RNA motifs and their description using a graph theory approach. Base-pair, base-step and helical parameters of double-stranded DNA. DNA curvature and deformability. Nucleosomes and compaction of genomic DNA in chromatin.

**Lectures 8-10 Towards the (statistical) mechanics of biomolecules. ** Newton’s laws of motion and Lagrangian/Hamiltonian formulation of classical mechanics. Laws of thermodynamics and ensemble concept. Microcanonical ensemble. Calculating the thermodynamic and equilibrium properties by a numerical approach: Molecular Dynamics (MD). Equations of motion and numerical integrators for single and multiple time scales. Canonical and isokinetic ensembles, related non-Hamiltonian equations of motion and some numerical integrators. Simple models of chain molecules and their statistical mechanics. Protein folding as a physical process. Levinthal's paradox and protein folding mechanisms. Limitations and advances in protein folding simulations.

**Lecture 11 Relationship between sequence and structure (Part – I). ** The general problem of mapping from biological sequence to structure. ** **Protein folding formulated as a mapping problem from sequence to structure. Protein secondary structure prediction. 3D protein structure prediction methods. Critical Assessment of protein Structure Prediction (CASP). Introduction to RNA secondary structure prediction. Algorithms for predicting the primary structure of chromatin from DNA sequence: from procedural and learning based algorithms to atomistic modelling approaches.

**Lecture 12 Relationship between sequence and structure (Part – II). **The general problem of finding sequences that fold into a given 3D structure. Introduction to the inverse protein folding (or protein design) problem. The designability of a 3D structure. Advances and applications in (*de novo*) protein design. The inverse RNA folding problem. Nucleosome positioning patterns, strong nucleosomes and search for the universal nucleosome mapping DNA sequence.

**Lectures 13-14 Modelling DNA, RNA and protein structures ** **(Part – I). **The birth of Computational Structural Biology. Atomistic and coarse-grained molecular models and energy functions. Modelling of interacting molecules and the Metropolis algorithm. Markov Chain Monte Carlo (MCMC), MD and Hybrid/Hamiltonian Monte Carlo (HMC) for biomolecular conformational sampling. Limitations of baseline (MCMC,MD,HMC) methods. Discussion of advanced conformational sampling algorithms including (but not limited to) Parallel Tempering (PT) (replica exchange MCMC) and the Equi-Energy Sampler. The Reverse Monte Carlo algorithm. Stochastic methods for conformational optimisation: Simulated Annealing (SA), Hybrid PT/SA, Monte Carlo Minimisation and Stochastic Tunneling.

**Lecture 15 Modelling DNA, RNA and protein structures ** **(Part – II).** The problem of high dimensionality (large # of Cartesian degrees of freedom) and illustrative examples. Dimensionality reduction by using torsional degrees of freedom. Invertible coordinate transformations. Limitations of torsional MCMC methods. Conformation updates based on a chain breakage/closure approach and introduction to the chain closure problem (inverse kinematics in biology). The Recursive Stochastic Closure (RSC) algorithm and the birth of the Natural Move Monte Carlo method. Hierarchical Natural Move Monte Carlo (HNMMC). Applications of (H)NMMC for RNA nanotechnology, cryo-Electron Microscopy (cryo-EM) and primary chromatin structure prediction.

**Lecture 16 Regulation of genetic information flow. ** The central dogma of Molecular Biology (DNA-> mRNA -> Proteins) revisited. Major milestones of deciphering the molecular basis of transcription (DNA->mRNA) and translation (mRNA->Proteins). Examples or transcriptional, post-transcriptional and translational regulations. Epigenetics and epigenetic modifications. Recognition mechanisms of DNA epigenetic modifications by proteins. Computational studies on quantifying the effect of multiple epigenetic modifications on DNA structures.

**Lectures 17-18 Genome (and RNA) Editing. **Main methods for genome editing. Introduction to CRISPR based technologies: base editing, CRISPRa, CRISPRi, gene editing. Description of the ‘classical’ CRISPR/Cas9 gene editing system. Discussion of methods for CRISPR/Cas9 target guide RNA efficiency prediction. The crisprSQL database for off-target cleavage assays. Introduction to procedural and deep learning-based algorithms for CRISPR/Cas9 off-target cleavage activity prediction. RNA editing with Cas13.

**Guest Lecture 1 Inverse RNA folding and Computational Riboswitch Detection. **Professor Danny Barash, *Department of Computer Science*, *Ben-Gurion University.*

**Guest Lecture 2 On finding similar structures in a large database. **Professor Rachel Kolodny *Department of Computer Science*, *University of Haifa*.

## Syllabus

Genetic information flow, its regulation and its molecular machinery. Biological sequence and structure analysis. Relationship between biological sequence and structure. Algorithms for modelling biomolecular structures. Computational approaches facilitating genome editing applications.

## Reading list

The lecture will be supported by slides and for some lectures reference(s) to journal (review) article(s) will be given. For biological sequence analysis, the textbook “*An Introduction to Bioinformatics Algorithms* by Neil C. Jones and Pavel A. Pevzner” may serve as a useful reference book.

## Feedback

Students are formally asked for feedback at the end of the course. Students can also submit feedback at any point here. Feedback received here will go to the Head of Academic Administration, and will be dealt with confidentially when being passed on further. All feedback is welcome.