1. The alternating direction method of multipliers for mixed-integer optimization applications [electronic resource] [2017] Online
- Book
- 1 online resource.
A very large number of practical optimization problems have been expressed as minimization of a convex objective function over a nonconvex set, specifically discrete sets such as the set of integer points. These problems arise in a variety of fields such as statistics and modeling, data analysis, and control. Two general approaches have been used for solving mixed integer optimization problems. One approach is to solve the problem globally using methods such as branch-and-bound, which suffer from exponential worst case time complexity. Another approach is to use heuristics such as semidefinite relaxation hierarchies which terminate in polynomial time, but do not guarantee finding the global solution. In this dissertation, we discuss a generic system of heuristic solutions for such problems. We describe an implementation of these methods in a package called NCVX, as an extension of CVXPY, a Python package for formulating and solving convex optimization problems. Our heuristics, which employ convex relaxations, convex restrictions, local neighbor search methods, and the alternating direction method of multipliers (ADMM), require the solution of a modest number of convex problems, and are meant to apply to general problems, without much tuning. Several numerical examples such as regressor selection, circle packing, the traveling salesman problem, and factor analysis modeling are studied. We also describe a fast optimization algorithm for approximately minimizing convex quadratic functions over the intersection of affine and separable constraints. We discuss the favorable computational aspects of our algorithm, which allow it to run quickly even on very modest computational platforms such as embedded processors. We study numerical examples including hybrid vehicle control and power converter control, and we compare our methods with existing open source and commercial alternatives. Finally we discuss how randomized linear programming heuristics can be used to solve graph isomorphism problems. We motivate our heuristics by showing guarantees under some conditions, and present numerical experiments that show effectiveness of these heuristics in the general case.
A very large number of practical optimization problems have been expressed as minimization of a convex objective function over a nonconvex set, specifically discrete sets such as the set of integer points. These problems arise in a variety of fields such as statistics and modeling, data analysis, and control. Two general approaches have been used for solving mixed integer optimization problems. One approach is to solve the problem globally using methods such as branch-and-bound, which suffer from exponential worst case time complexity. Another approach is to use heuristics such as semidefinite relaxation hierarchies which terminate in polynomial time, but do not guarantee finding the global solution. In this dissertation, we discuss a generic system of heuristic solutions for such problems. We describe an implementation of these methods in a package called NCVX, as an extension of CVXPY, a Python package for formulating and solving convex optimization problems. Our heuristics, which employ convex relaxations, convex restrictions, local neighbor search methods, and the alternating direction method of multipliers (ADMM), require the solution of a modest number of convex problems, and are meant to apply to general problems, without much tuning. Several numerical examples such as regressor selection, circle packing, the traveling salesman problem, and factor analysis modeling are studied. We also describe a fast optimization algorithm for approximately minimizing convex quadratic functions over the intersection of affine and separable constraints. We discuss the favorable computational aspects of our algorithm, which allow it to run quickly even on very modest computational platforms such as embedded processors. We study numerical examples including hybrid vehicle control and power converter control, and we compare our methods with existing open source and commercial alternatives. Finally we discuss how randomized linear programming heuristics can be used to solve graph isomorphism problems. We motivate our heuristics by showing guarantees under some conditions, and present numerical experiments that show effectiveness of these heuristics in the general case.
2. Lightweight statistical learning [electronic resource] : accelerating and avoiding empirical risk minimization [2017] Online
- Book
- 1 online resource.
In statistical machine learning, the goal is to train a model that, once deployed in the world, continues to predict accurately on fresh data. A unifying training principle is empirical risk minimization (ERM): globally minimizing the average prediction error incurred on a representative training set. Essentially, ERM prescribes optimization as a proxy for statistical estimation. Learning tasks for which ERM is computationally tractable call for optimization algorithms that scale as efficiently as possible as training set dimensions grow. Other tasks---namely, neural network learning and learning with indirect supervision---elude a general, tractable ERM algorithm in theory, motivating a reformulation of the training problem altogether. In this thesis, we first focus on efficient algorithms for empirical risk minimization in the overdetermined, convex setting with fully-observed data, where ERM is tractable and the aim is optimal computational efficiency. We improve the guaranteed running time for a broad range of problems (including basic regression problems) in terms of the condition number induced by the underlying training set. Atop these methods, we develop an algorithm for principal component regression. Next, we move to settings where general tractability of ERM is not guaranteed, but still guides algorithm design in practice. Specifically, we study two learning frameworks: convolutional neural network learning, and learning conditional random fields from indirect supervision. In either context, the prevalent replacement to global ERM is gradient descent. Since descent could converge to arbitrarily bad local optima, it renders initialization more relevant. In each setting, we develop lightweight training algorithms based on convex procedures. For neural networks, we consider replacing optimization of hidden layers with randomization. For indirect supervision, we reduce estimation to solving a linear system followed by a convex maximum-likelihood step. Via the role of initialization, both algorithms imply conditions under which local descent ought to fare at least as well as they do.
In statistical machine learning, the goal is to train a model that, once deployed in the world, continues to predict accurately on fresh data. A unifying training principle is empirical risk minimization (ERM): globally minimizing the average prediction error incurred on a representative training set. Essentially, ERM prescribes optimization as a proxy for statistical estimation. Learning tasks for which ERM is computationally tractable call for optimization algorithms that scale as efficiently as possible as training set dimensions grow. Other tasks---namely, neural network learning and learning with indirect supervision---elude a general, tractable ERM algorithm in theory, motivating a reformulation of the training problem altogether. In this thesis, we first focus on efficient algorithms for empirical risk minimization in the overdetermined, convex setting with fully-observed data, where ERM is tractable and the aim is optimal computational efficiency. We improve the guaranteed running time for a broad range of problems (including basic regression problems) in terms of the condition number induced by the underlying training set. Atop these methods, we develop an algorithm for principal component regression. Next, we move to settings where general tractability of ERM is not guaranteed, but still guides algorithm design in practice. Specifically, we study two learning frameworks: convolutional neural network learning, and learning conditional random fields from indirect supervision. In either context, the prevalent replacement to global ERM is gradient descent. Since descent could converge to arbitrarily bad local optima, it renders initialization more relevant. In each setting, we develop lightweight training algorithms based on convex procedures. For neural networks, we consider replacing optimization of hidden layers with randomization. For indirect supervision, we reduce estimation to solving a linear system followed by a convex maximum-likelihood step. Via the role of initialization, both algorithms imply conditions under which local descent ought to fare at least as well as they do.
3. Online active learning with linear models [electronic resource] [2017] Online
- Book
- 1 online resource.
In this thesis we address online decision making problems where an agent needs to collect optimal training data to fit statistical models. Decisions on whether to request the response of incoming stochastic observations or not must be made on the spot, and, when selected, the outcomes are immediately revealed. In particular, in the first part, we study scenarios where a single linear model is estimated given a limited budget: only k out of the n incoming observations can be labelled. In the second part, we focus on active learning in the presence of several linear models with heterogeneous unknown noise levels. The goal is to estimate all the models equally well. We design algorithms to efficiently solve the problems, extend them to sparse high-dimensional settings, and derive statistical guarantees in all cases. In addition, we validate our algorithms with synthetic and real-world data, where most of our technical assumptions are violated. Finally, we briefly explore active learning in pure exploration settings for reinforcement learning. In this case, an unknown Markov Decision Process needs to be learned under a fixed budget of episodes.
In this thesis we address online decision making problems where an agent needs to collect optimal training data to fit statistical models. Decisions on whether to request the response of incoming stochastic observations or not must be made on the spot, and, when selected, the outcomes are immediately revealed. In particular, in the first part, we study scenarios where a single linear model is estimated given a limited budget: only k out of the n incoming observations can be labelled. In the second part, we focus on active learning in the presence of several linear models with heterogeneous unknown noise levels. The goal is to estimate all the models equally well. We design algorithms to efficiently solve the problems, extend them to sparse high-dimensional settings, and derive statistical guarantees in all cases. In addition, we validate our algorithms with synthetic and real-world data, where most of our technical assumptions are violated. Finally, we briefly explore active learning in pure exploration settings for reinforcement learning. In this case, an unknown Markov Decision Process needs to be learned under a fixed budget of episodes.
4. Convex optimization for Monte Carlo [electronic resource] : stochastic optimization for importance sampling [2016] Online
- Book
- 1 online resource.
In Monte Carlo simulations it is often essential that a method is accompanied by an appropriate variance reduction method. Reducing the variance of a Monte Carlo method is, at least conceptually, an optimization problem, and mathematical optimization has indeed been used as a theoretical and conceptual tool in this pursuit. However, traditional Monte Carlo methods have only used numerical optimization sparingly, and convex optimization even less. Numerical optimization is study of algorithms for finding a solution to an optimization problem, as opposed to the study of analytical solutions of an optimization problem. Convex optimization is the study of convex optimization problems, a subclass of optimization problems for which efficient algorithms for finding the global optimum exists. In this work we present a framework for using convex optimization for Monte Carlo. More specifically, we present a framework for using stochastic convex optimization for adaptive importance sampling, self-normalized importance sampling, and what-if simulations. The main idea is to perform importance sampling and numerical optimization simultaneously. In particular, the numerical optimization does not rely on blackbox optimization solvers, and this allows the computational cost of each iteration to remain cheap. Because the optimization is performed on a convex problem, we can establish convergence and optimality.
In Monte Carlo simulations it is often essential that a method is accompanied by an appropriate variance reduction method. Reducing the variance of a Monte Carlo method is, at least conceptually, an optimization problem, and mathematical optimization has indeed been used as a theoretical and conceptual tool in this pursuit. However, traditional Monte Carlo methods have only used numerical optimization sparingly, and convex optimization even less. Numerical optimization is study of algorithms for finding a solution to an optimization problem, as opposed to the study of analytical solutions of an optimization problem. Convex optimization is the study of convex optimization problems, a subclass of optimization problems for which efficient algorithms for finding the global optimum exists. In this work we present a framework for using convex optimization for Monte Carlo. More specifically, we present a framework for using stochastic convex optimization for adaptive importance sampling, self-normalized importance sampling, and what-if simulations. The main idea is to perform importance sampling and numerical optimization simultaneously. In particular, the numerical optimization does not rely on blackbox optimization solvers, and this allows the computational cost of each iteration to remain cheap. Because the optimization is performed on a convex problem, we can establish convergence and optimality.
5. Extensions of Bayesian sequential partitioning for multivariate density estimation [electronic resource] [2014] Online
- Book
- 1 online resource.
Density estimation is a fundamental problem in statistic, which can be used as a building block in statistical methods such as classification. Traditional approaches, such as the kernel method, have yielded accurate results when dealing with a moderate amount of data in a low dimensional space, but may perform poorly in higher dimension. Bayesian sequential partitioning (BSP) was proposed to overcome the drawback of traditional methods by applying scalable algorithms to estimate the probability density in higher dimension. The resulting estimate is a piecewise constant function supported on a partition that is learned from the data. In this work, two extensions/enhancements of BSP are proposed to broaden the area of its application. First, a smoothed version of the estimate is obtained by optimizing over a tensor-spline expansion of the log-density condition on the partition learned by BSP. Second, we develop a version of BSP that uses oriented cuts to produce non-rectangular partitions. By allowing more options to build the partition, our estimate is made invariant to the direction of data point distribution. Our numerical experiments show that the resulting estimate can achieve better approximation to the true density compared to standard BSP.
Density estimation is a fundamental problem in statistic, which can be used as a building block in statistical methods such as classification. Traditional approaches, such as the kernel method, have yielded accurate results when dealing with a moderate amount of data in a low dimensional space, but may perform poorly in higher dimension. Bayesian sequential partitioning (BSP) was proposed to overcome the drawback of traditional methods by applying scalable algorithms to estimate the probability density in higher dimension. The resulting estimate is a piecewise constant function supported on a partition that is learned from the data. In this work, two extensions/enhancements of BSP are proposed to broaden the area of its application. First, a smoothed version of the estimate is obtained by optimizing over a tensor-spline expansion of the log-density condition on the partition learned by BSP. Second, we develop a version of BSP that uses oriented cuts to produce non-rectangular partitions. By allowing more options to build the partition, our estimate is made invariant to the direction of data point distribution. Our numerical experiments show that the resulting estimate can achieve better approximation to the true density compared to standard BSP.
Special Collections
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2014 K | In-library use |
- Book
- 1 online resource.
The "Big Data" revolution is spawning systems designed to make decisions from data. Statistics and machine learning has made great strides in prediction and estimation from any fixed dataset. However, if you want to learn to take actions where your choices can affect both the underlying system and the data you observe, you need reinforcement learning. Reinforcement learning builds upon learning from datasets, but also addresses the issues of partial feedback and long term consequences. In a reinforcement learning problem the decisions you make may affect the data you get, and even alter the underlying system for future timesteps. Statistically efficient reinforcement learning requires "deep exploration" or the ability to plan to learn. Previous approaches to deep exploration have not been computationally tractable beyond small scale problems. For this reason, most practical implementations use statistically inefficient methods for exploration such as epsilon-greedy dithering, which can lead to exponentially slower learning. In this dissertation we present an alternative approach to deep exploration through the use of randomized value functions. Our work is inspired by the Thompson sampling heuristic for multi-armed bandits which suggests, at a high level, to "randomly select a policy according to the probability that it is optimal". We provide insight into why this algorithm can be simultaneously more statistically efficient and more computationally efficient than existing approaches. We leverage these insights to establish several state of the art theoretical results and performance guarantees. Importantly, and unlike previous approaches to deep exploration, this approach also scales gracefully to complex domains with generalization. We complement our analysis with extensive empirical experiments; these include several didactic examples as well as a recommendation system, Tetris, and Atari 2600 games.
The "Big Data" revolution is spawning systems designed to make decisions from data. Statistics and machine learning has made great strides in prediction and estimation from any fixed dataset. However, if you want to learn to take actions where your choices can affect both the underlying system and the data you observe, you need reinforcement learning. Reinforcement learning builds upon learning from datasets, but also addresses the issues of partial feedback and long term consequences. In a reinforcement learning problem the decisions you make may affect the data you get, and even alter the underlying system for future timesteps. Statistically efficient reinforcement learning requires "deep exploration" or the ability to plan to learn. Previous approaches to deep exploration have not been computationally tractable beyond small scale problems. For this reason, most practical implementations use statistically inefficient methods for exploration such as epsilon-greedy dithering, which can lead to exponentially slower learning. In this dissertation we present an alternative approach to deep exploration through the use of randomized value functions. Our work is inspired by the Thompson sampling heuristic for multi-armed bandits which suggests, at a high level, to "randomly select a policy according to the probability that it is optimal". We provide insight into why this algorithm can be simultaneously more statistically efficient and more computationally efficient than existing approaches. We leverage these insights to establish several state of the art theoretical results and performance guarantees. Importantly, and unlike previous approaches to deep exploration, this approach also scales gracefully to complex domains with generalization. We complement our analysis with extensive empirical experiments; these include several didactic examples as well as a recommendation system, Tetris, and Atari 2600 games.
Articles+
Journal articles, e-books, & other e-resources
- Articles+ results include