Online 41. Learning with Ngrams [electronic resource] : from massive scales to compressed representations [2017]
 Description
 Book — 1 online resource.
 Summary

Machine learning has established itself as an important driver of industrial progress and scientific discovery. The quest to expand its usage to address ever deeper questions and harder problems places particular emphasis on building sophisticated and statistically rigorous models that can handle the deluge of information being generated. The stakes are higher than ever; the success of global, billion dollar initiatives that can fundamentally change the landscape of human health rests on the existence of machine learning tools that can extract intricate relationships at unprecedented scales. In turn, machine learning paradigms are constantly evolving to address these needs, and some of the greatest advances have come from integrating combinatorial ideas with classical statistical ideas, such as the ability to perform principled feature selection using the Lasso. The underlying perspective of this thesis is that machine learning must rely on the algorithms and data structures that classically form the underpinnings of theoretical computer science in order to fully harness the potential of these combinatorial ideas. To this end, we contribute two advances to machine learning based on Ngram features, a feature representation for strings that has stood the test of time and continues to provide stateoftheart results in natural language processing and genomics. The first addresses the computational and statistical issues of learning with long, and possibly all, Ngrams in a document corpus. Our main result leverages suffix trees to provide a quadratic memory and processing time improvement over current machine learning systems by virtue of a fast matrixvector multiplication routine whose computational requirements are at worst linear in the length of the underlying document corpus. As the majority of machine learning algorithms rely on and are bottlenecked by matrixvector multiplication to learn, our routine can speed up almost any learning system by simply replacing its multiplication routine with ours. The practical savings are substantial, including an efficiency gain of four orders of magnitude for DNA sequence data, and open a new realm of possibilities for Ngram models. This routine also has large statistical implications; suffix trees perform a quadratic dimensionality reduction that substantially increases the robustness of machine learning systems when the appropriate level of data representation granularity is unknown. Finally, we provide an efficient persistent data storage system based on our algorithms that screens Ngram features according to a multitude of statistical criteria and produces data structures optimized for multiplication. Our second contribution looks to classical ideas from compression to devise a new form of combinatorial Deep Learning for text termed Dracula. Dracula is based on a generalization of the compression criterion underlying dictionarybased compressors like LempelZiv 78. It learns a dictionary of Ngrams that efficiently compresses a text corpus, and then recursively compresses its own dictionary for additional space savings. In doing so, it selects Ngrams that are useful features for learning and induces a graphbased regularizer that orders the Ngrams into low and high frequency components. Importantly, solving Dracula can be expressed as a binary linear program that may be further relaxed to a linear program, allowing a plurality of tools from optimization and computer science to be used to analyze its properties. Computationally, Dracula is NPComplete, but it exhibits substantial problem structure that allows approximate algorithms to scale to large datasets. Statistically, we show how Dracula can learn a multitude of representations to accommodate an underlying storage cost model and identify parameters that control the behavior of its solutions in meaningful ways. We also demonstrate that Dracula is amenable to fine tuning by proving that its solutions evolve in a predictable way as the storage cost model varies. We demonstrate the utility of Dracula's features using experiments over a variety of problem domains including natural language processing and bioinformatics.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request via Aeon (opens in new tab) 
3781 2017 P  Inlibrary use 