Topics in statistical learning with a focus on large-scale data
- Ya Le.
- [Stanford, California] : [Stanford University], 2018.
- Copyright notice
- Physical description
- 1 online resource.
Also available at
At the library
All items must be viewed on site
Request items at least 2 days before you visit to allow retrieval from off-site storage. You can request at most 5 items per day.
|3781 2018 L||In-library use|
- The widespread of modern information technologies to all spheres of society leads to a dramatic increase of data flow, including the formation of "big data" phenomenon. Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the ability of traditional statistical methods and standard tools. When the size of the data becomes extremely large, it may be too long to run the computing task, and even infeasible to store all of the data on a single computer. Therefore, it is necessary to turn to distributed architectures and scalable statistical methods. Big data vary in shape and call for different approaches. One type of big data is the tall data, i.e., a very large number of samples but not too many features. Chapter 1 describes a general communication-efficient algorithm for distributed statistical learning on this type of big data. Our algorithm distributes the samples uniformly to multiple machines, and uses a common reference data to improve the performance of local estimates. Our algorithm enables potentially much faster analysis, at a small cost to statistical performance. Another type of big data is the wide data, i.e., too many features but a limited number of samples. It is also called high-dimensional data, to which many classical statistical methods are not applicable. Chapter 2 discusses a method of dimensionality reduction for high-dimensional classification. Our method partitions features into independent communities and splits the original classification problem into separate iv smaller ones. It enables parallel computing and produces more interpretable results. For unsupervised learning methods like principle component analysis and clustering, the key challenges are choosing the optimal tuning parameter and evaluating method performance. Chapter 3 proposes a general cross-validation approach for unsupervised learning methods. This approach randomly partitions the data matrix into K unstructured folds. For each fold, it fits a matrix completion algorithm to the rest K − 1 folds and evaluates the prediction on the hold-out fold. Our approach provides a unified framework for parameter tuning in unsupervised learning, and shows strong performance in practice.
- Publication date
- Copyright date
- Submitted to the Department of Statistics.
- Thesis Ph.D. Stanford University 2018.
Browse related items
Start at call number: