1. Lightweight statistical learning [electronic resource] : accelerating and avoiding empirical risk minimization [2017] Online
- Book
- 1 online resource.
In statistical machine learning, the goal is to train a model that, once deployed in the world, continues to predict accurately on fresh data. A unifying training principle is empirical risk minimization (ERM): globally minimizing the average prediction error incurred on a representative training set. Essentially, ERM prescribes optimization as a proxy for statistical estimation. Learning tasks for which ERM is computationally tractable call for optimization algorithms that scale as efficiently as possible as training set dimensions grow. Other tasks---namely, neural network learning and learning with indirect supervision---elude a general, tractable ERM algorithm in theory, motivating a reformulation of the training problem altogether. In this thesis, we first focus on efficient algorithms for empirical risk minimization in the overdetermined, convex setting with fully-observed data, where ERM is tractable and the aim is optimal computational efficiency. We improve the guaranteed running time for a broad range of problems (including basic regression problems) in terms of the condition number induced by the underlying training set. Atop these methods, we develop an algorithm for principal component regression. Next, we move to settings where general tractability of ERM is not guaranteed, but still guides algorithm design in practice. Specifically, we study two learning frameworks: convolutional neural network learning, and learning conditional random fields from indirect supervision. In either context, the prevalent replacement to global ERM is gradient descent. Since descent could converge to arbitrarily bad local optima, it renders initialization more relevant. In each setting, we develop lightweight training algorithms based on convex procedures. For neural networks, we consider replacing optimization of hidden layers with randomization. For indirect supervision, we reduce estimation to solving a linear system followed by a convex maximum-likelihood step. Via the role of initialization, both algorithms imply conditions under which local descent ought to fare at least as well as they do.
In statistical machine learning, the goal is to train a model that, once deployed in the world, continues to predict accurately on fresh data. A unifying training principle is empirical risk minimization (ERM): globally minimizing the average prediction error incurred on a representative training set. Essentially, ERM prescribes optimization as a proxy for statistical estimation. Learning tasks for which ERM is computationally tractable call for optimization algorithms that scale as efficiently as possible as training set dimensions grow. Other tasks---namely, neural network learning and learning with indirect supervision---elude a general, tractable ERM algorithm in theory, motivating a reformulation of the training problem altogether. In this thesis, we first focus on efficient algorithms for empirical risk minimization in the overdetermined, convex setting with fully-observed data, where ERM is tractable and the aim is optimal computational efficiency. We improve the guaranteed running time for a broad range of problems (including basic regression problems) in terms of the condition number induced by the underlying training set. Atop these methods, we develop an algorithm for principal component regression. Next, we move to settings where general tractability of ERM is not guaranteed, but still guides algorithm design in practice. Specifically, we study two learning frameworks: convolutional neural network learning, and learning conditional random fields from indirect supervision. In either context, the prevalent replacement to global ERM is gradient descent. Since descent could converge to arbitrarily bad local optima, it renders initialization more relevant. In each setting, we develop lightweight training algorithms based on convex procedures. For neural networks, we consider replacing optimization of hidden layers with randomization. For indirect supervision, we reduce estimation to solving a linear system followed by a convex maximum-likelihood step. Via the role of initialization, both algorithms imply conditions under which local descent ought to fare at least as well as they do.
2. Convex optimization for Monte Carlo [electronic resource] : stochastic optimization for importance sampling [2016] Online
- Book
- 1 online resource.
In Monte Carlo simulations it is often essential that a method is accompanied by an appropriate variance reduction method. Reducing the variance of a Monte Carlo method is, at least conceptually, an optimization problem, and mathematical optimization has indeed been used as a theoretical and conceptual tool in this pursuit. However, traditional Monte Carlo methods have only used numerical optimization sparingly, and convex optimization even less. Numerical optimization is study of algorithms for finding a solution to an optimization problem, as opposed to the study of analytical solutions of an optimization problem. Convex optimization is the study of convex optimization problems, a subclass of optimization problems for which efficient algorithms for finding the global optimum exists. In this work we present a framework for using convex optimization for Monte Carlo. More specifically, we present a framework for using stochastic convex optimization for adaptive importance sampling, self-normalized importance sampling, and what-if simulations. The main idea is to perform importance sampling and numerical optimization simultaneously. In particular, the numerical optimization does not rely on blackbox optimization solvers, and this allows the computational cost of each iteration to remain cheap. Because the optimization is performed on a convex problem, we can establish convergence and optimality.
In Monte Carlo simulations it is often essential that a method is accompanied by an appropriate variance reduction method. Reducing the variance of a Monte Carlo method is, at least conceptually, an optimization problem, and mathematical optimization has indeed been used as a theoretical and conceptual tool in this pursuit. However, traditional Monte Carlo methods have only used numerical optimization sparingly, and convex optimization even less. Numerical optimization is study of algorithms for finding a solution to an optimization problem, as opposed to the study of analytical solutions of an optimization problem. Convex optimization is the study of convex optimization problems, a subclass of optimization problems for which efficient algorithms for finding the global optimum exists. In this work we present a framework for using convex optimization for Monte Carlo. More specifically, we present a framework for using stochastic convex optimization for adaptive importance sampling, self-normalized importance sampling, and what-if simulations. The main idea is to perform importance sampling and numerical optimization simultaneously. In particular, the numerical optimization does not rely on blackbox optimization solvers, and this allows the computational cost of each iteration to remain cheap. Because the optimization is performed on a convex problem, we can establish convergence and optimality.
3. Extensions of Bayesian sequential partitioning for multivariate density estimation [electronic resource] [2014] Online
- Book
- 1 online resource.
Density estimation is a fundamental problem in statistic, which can be used as a building block in statistical methods such as classification. Traditional approaches, such as the kernel method, have yielded accurate results when dealing with a moderate amount of data in a low dimensional space, but may perform poorly in higher dimension. Bayesian sequential partitioning (BSP) was proposed to overcome the drawback of traditional methods by applying scalable algorithms to estimate the probability density in higher dimension. The resulting estimate is a piecewise constant function supported on a partition that is learned from the data. In this work, two extensions/enhancements of BSP are proposed to broaden the area of its application. First, a smoothed version of the estimate is obtained by optimizing over a tensor-spline expansion of the log-density condition on the partition learned by BSP. Second, we develop a version of BSP that uses oriented cuts to produce non-rectangular partitions. By allowing more options to build the partition, our estimate is made invariant to the direction of data point distribution. Our numerical experiments show that the resulting estimate can achieve better approximation to the true density compared to standard BSP.
Density estimation is a fundamental problem in statistic, which can be used as a building block in statistical methods such as classification. Traditional approaches, such as the kernel method, have yielded accurate results when dealing with a moderate amount of data in a low dimensional space, but may perform poorly in higher dimension. Bayesian sequential partitioning (BSP) was proposed to overcome the drawback of traditional methods by applying scalable algorithms to estimate the probability density in higher dimension. The resulting estimate is a piecewise constant function supported on a partition that is learned from the data. In this work, two extensions/enhancements of BSP are proposed to broaden the area of its application. First, a smoothed version of the estimate is obtained by optimizing over a tensor-spline expansion of the log-density condition on the partition learned by BSP. Second, we develop a version of BSP that uses oriented cuts to produce non-rectangular partitions. By allowing more options to build the partition, our estimate is made invariant to the direction of data point distribution. Our numerical experiments show that the resulting estimate can achieve better approximation to the true density compared to standard BSP.
Special Collections
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2014 K | In-library use |
- Book
- 1 online resource.
The "Big Data" revolution is spawning systems designed to make decisions from data. Statistics and machine learning has made great strides in prediction and estimation from any fixed dataset. However, if you want to learn to take actions where your choices can affect both the underlying system and the data you observe, you need reinforcement learning. Reinforcement learning builds upon learning from datasets, but also addresses the issues of partial feedback and long term consequences. In a reinforcement learning problem the decisions you make may affect the data you get, and even alter the underlying system for future timesteps. Statistically efficient reinforcement learning requires "deep exploration" or the ability to plan to learn. Previous approaches to deep exploration have not been computationally tractable beyond small scale problems. For this reason, most practical implementations use statistically inefficient methods for exploration such as epsilon-greedy dithering, which can lead to exponentially slower learning. In this dissertation we present an alternative approach to deep exploration through the use of randomized value functions. Our work is inspired by the Thompson sampling heuristic for multi-armed bandits which suggests, at a high level, to "randomly select a policy according to the probability that it is optimal". We provide insight into why this algorithm can be simultaneously more statistically efficient and more computationally efficient than existing approaches. We leverage these insights to establish several state of the art theoretical results and performance guarantees. Importantly, and unlike previous approaches to deep exploration, this approach also scales gracefully to complex domains with generalization. We complement our analysis with extensive empirical experiments; these include several didactic examples as well as a recommendation system, Tetris, and Atari 2600 games.
The "Big Data" revolution is spawning systems designed to make decisions from data. Statistics and machine learning has made great strides in prediction and estimation from any fixed dataset. However, if you want to learn to take actions where your choices can affect both the underlying system and the data you observe, you need reinforcement learning. Reinforcement learning builds upon learning from datasets, but also addresses the issues of partial feedback and long term consequences. In a reinforcement learning problem the decisions you make may affect the data you get, and even alter the underlying system for future timesteps. Statistically efficient reinforcement learning requires "deep exploration" or the ability to plan to learn. Previous approaches to deep exploration have not been computationally tractable beyond small scale problems. For this reason, most practical implementations use statistically inefficient methods for exploration such as epsilon-greedy dithering, which can lead to exponentially slower learning. In this dissertation we present an alternative approach to deep exploration through the use of randomized value functions. Our work is inspired by the Thompson sampling heuristic for multi-armed bandits which suggests, at a high level, to "randomly select a policy according to the probability that it is optimal". We provide insight into why this algorithm can be simultaneously more statistically efficient and more computationally efficient than existing approaches. We leverage these insights to establish several state of the art theoretical results and performance guarantees. Importantly, and unlike previous approaches to deep exploration, this approach also scales gracefully to complex domains with generalization. We complement our analysis with extensive empirical experiments; these include several didactic examples as well as a recommendation system, Tetris, and Atari 2600 games.
Looking for different results?
Modify your search: Search all fields
Search elsewhere: Search WorldCat Search library website