Monday, May 1, 2017 at 11:00 am in Rice 504
Committee: Yanjun Qi (Advisor), Mary Lou Soffa (Chair), Gabriel Robins, Christina Leslie, Memorial Sloan Kettering Cancer Center; Mazhar Adli, UVA Dept. of Biochemistry and Molecular Genetics (Minor Representative).
Title: Fast and Interpretable Classification of Sequential Data in Biology
Machine learning models have shown great success in helping biologists to analyze sequential data (like DNA sequences or measurements of activity levels along the genome). However, the state-of-the-art machine learning methods face two hard challenges posed by sequential data: (1) Interpretability of the predictions for better insights (2) Slow computation due to expanding search space of sequential patterns. The proposed research aims to solve these two challenges by improving the existing state-of-the-art machine learning methods. Specifically, we focus on two popular models: Neural Networks (NNs) and String Kernel with Support Vector Machines (SK-SVM).
+Challenge (1): NNs can handle large sequential datasets accurately and in an efficient manner. However, NNs have widely been viewed as `black boxes’ due to their complexity, making them hard to understand. +Solution (1): We propose a unified architecture – Attentive-DeepChrome – that handles prediction and understanding in an end-to-end manner.
+Challenge (2): SK-SVM methods achieve high accuracy and have theoretical guarantees for smaller datasets (<5,000 samples). However, current implementations run extremely slow when we increase the dictionary size or allow more mismatches. +Solution (2): We propose a novel algorithm for calculating Gapped k-mer Kernel using Counting (GaKCo). This algorithm is fast, scalable and naturally parallelizable.
In summary, this research work expands the frontiers of the existing machine learning methods for improved analysis of sequential data in biology.