Dataset distillation for CRISPR-Cas9 guide-target library design

Supervisor

Jeffrey Mak

Suitable for

MSc in Advanced Computer Science

Computer Science, Part C

Abstract

Prerequisites:

Essential: Computational Biology, familiarity with PyTorch.
Desirable: Interest in dataset distillation.

Abstract:

The success of deep learning-based CRISPR-Cas9 cleavage activity prediction models relies on the availability of large cleavage activity datasets [1]. Containing tens of thousands to hundreds of thousands of guide-target pairs, these datasets are experimentally generated through high-throughput guide-target lentiviral library screens, where guide-target libraries are manually designed based on target genes of interest or properties like the spacer sequence’s GC content. Owing to the high cost of these library screens, it is infeasible to conduct library screens at scale across hundreds of Cas9 variants.

To address this issue, this project explores the use of data distillation [2,3] to obtain smaller synthetic sets of guide-target pairs from the original large dataset, and the potential of using such synthetic sets as guide-target libraries for other Cas9 variants. If successful, this project would reduce the cost of library screens per Cas9 variant, and thus the data curation cost required for building a unified cleavage activity tool for Cas9 variants.

[1] Xiang, X., Corsi, G. I., Anthon, C., Qu, K., Pan, X., Liang, X., ... & Luo, Y. (2021). Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning. Nature communications, 12(1), 3238.

[2] Wang, T., Zhu, J. Y., Torralba, A., & Efros, A. A. (2018). Dataset distillation. arXiv preprint arXiv:1811.10959.

[3] Lei, S., & Tao, D. (2023). A comprehensive survey of dataset distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1), 17-32.

Dataset distillation for CRISPR-Cas9 guide-target library design

Supervisor

Suitable for

Abstract

Student Space