Almost linear time algorithms for problems of computational genomics
- Govinda Mangalore Kamath.
- [Stanford, California] : [Stanford University], 2019.
- Copyright notice
- Physical description
- 1 online resource.
Also available at
- Kamath, Govinda Mangalore, author.
- Tse, David, degree supervisor.
- Montanari, Andrea, degree committee member.
- Sidford, Aaron, degree committee member.
- Stanford University. Department of Electrical Engineering.
- In this thesis we discuss designing fast algorithms for three problems in computational genomics. 1) Genome Assembly: With the proliferation of DNA sequencing and decreasing prices, genome assembly is the fundamental problem in computational genomics. Every organism has a genome, which can be abstracted as a long string of A, C, G, T. The length of the genome varies from millions in bacteria to hundreds of billions in some plants. We currently do not have the technology to read genomes end to end and just get short-noisy substrings from the genome called reads. The lengths of these reads are around 10k and error rate is around 15%. Reconstructing the genome from the reads is the problem of genome assembly. The conditions for being able to recover the genome exactly depends on the repeat structure of the genome and are explored extensively in literature. But in practice, we find that there are many datasets where this is violated. We develop algorithms that can optimally recover the genome to within informationally possible limits given a set of reads. We also implement this in the form of an open source software: HINGE, which is used widely in bacterial genomics. This work is joint work with Ilan Shomorony, Fei Xia, Tom Courtade and David Tse. 2) Haplotype Assembly: This is a problem associated with reference based Sequence assembly. Humans are diploid: that is, they have two copies of each chromosome. Roughly speaking, the two chromosome are identical on all apart from a few positions (around 1\% of positions) called single nucleotide polymorphisms or SNPs. In this case reads can be substrings of either chromosome which is not known, and one wants to infer the sequence of SNPs on each chromosome. We show that this problem can be seen as decoding a convolutional code, and reduce it to a graph clustering problem. We then come up with a spectral machine learning algorithm to solve the problem. We show derive tight estimates of the number of measurements necessary theoretically. This work is joint work with Eren Sasoglu, Yuxin Chen and David Tse. 3) Speeding up Machine learning algorithms: The celebrated Monte Carlo method estimates a quantity that is expensive to compute by random sampling. We propose adaptive Monte Carlo optimization: a general framework for discrete optimization of an expensive-to-compute function by adaptive random sampling. Applications of this framework have already appeared in machine learning but are tied to their specific contexts and developed in isolation. We take a unified view and show that the framework has broad applicability by applying it on several common machine learning problems: k-nearest neighbors, k-medoids, hierarchical clustering and maximum mutual information feature selection. On real data we show that this framework allows us to develop algorithms that confer a gain of a magnitude or two over exact computation. We also characterize the performance gain theoretically under regularity assumptions on the data that we verify in real world data. We stumbled onto this problem and approach while trying to run k-medoids on a single-cell RNA-seq dataset. This is joint work with Vivek Bagaria, Tavor Baharav, Vasilis Ntranos, Martin Zhang, and David Tse.
- Publication date
- Copyright date
- Submitted to the Department of Electrical Engineering.
- Thesis Ph.D. Stanford University 2019.