Data Science Topics Related to Neurogenomics
My seminar will discuss various data-science issues related to neurogenomics. First, I will focus on classic disorders of the brain, which affect nearly a fifth of the world's population. Robust phenotype-genotype associations have been established for several psychiatric diseases (e.g., schizophrenia, bipolar disorder). However, understanding their molecular causes is still a challenge. To address this, the PsychENCODE consortium generated thousands of transcriptome (bulk and single-cell) datasets from 1,866 individuals. Using these data, we have developed interpretable machine learning approaches for deciphering functional genomic elements and linkages in the brain and psychiatric disorders. Specifically, we developed a deep-learning model embedding the physical regulatory network to predict phenotype from genotype. Our model uses a conditional Deep Boltzmann Machine architecture and introduces lateral connectivity at the visible layer to embed the biological structure learned from the regulatory network and QTL linkages. Our model improves disease prediction (6X compared to additive polygenic risk scores), highlights key genes for disorders, and imputes missing transcriptome information from genotype data alone. Next, I will look at the "data exhaust" from this activity - that is, how one can find other things from the genomic analyses than what is necessarily intended. I will focus on genomic privacy, which is a main stumbling block in tackling problems in large-scale neurogenomics. In particular, I will look at how the quantifications of expression levels can reveal something about the subjects studied and how one can take steps to sanitize the data and protect patient anonymity. Finally, another stumbling block in neurogenomics is more accurately and precisely phenotyping the individuals. I will discuss some preliminary work we've done in digital phenotyping.