- Book
- xiv, 183 p.
SAL3 (off-campus storage), Special Collections
SAL3 (off-campus storage) | Status |
---|---|
Stacks | Request |
3781 2009 A | Available |
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2009 A | In-library use |
- Book
- xvii, 134 leaves, bound.
SAL3 (off-campus storage), Special Collections
SAL3 (off-campus storage) | Status |
---|---|
Stacks | Request |
3781 2009 C | Available |
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2009 C | In-library use |
- Book
- xiv, 112 leaves bound.
SAL3 (off-campus storage), Special Collections
SAL3 (off-campus storage) | Status |
---|---|
Stacks | Request |
3781 2007 B | Available |
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2007 B | In-library use |
4. Mathematical models in homeland security [2006]
- Book
- xiii, 93 leaves, bound.
SAL3 (off-campus storage), Special Collections
SAL3 (off-campus storage) | Status |
---|---|
Stacks | Request |
3781 2006 L | Available |
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2006 L | In-library use |
- Book
- ix, 100 leaves.
SAL3 (off-campus storage), Special Collections
SAL3 (off-campus storage) | Status |
---|---|
Stacks | Request |
3781 2011 J | Available |
Special Collections | Status |
---|---|
University Archives | Request on-site access |
3781 2011 J | In-library use |
6. Data-sparse algorithms for structured matrices [electronic resource] [2017] Online
- Book
- 1 online resource.
In the first part of the dissertation, we present a method for updating certain hierarchical factorizations for solving linear integral equations with elliptic kernels. In particular, given a factorization corresponding to some initial geometry or material parameters, we can locally perturb the geometry or coefficients and update the initial factorization to reflect this change with asymptotic complexity that is polylogarithmic in the total number of unknowns and linear in the number of perturbed unknowns. In the second part, we consider the application of hierarchical factorizations to the problem of spatial Gaussian process maximum likelihood estimation, i.e., parameter fitting for kriging. We present a framework for scattered (quasi-)two-dimensional observations using skeletonization factorizations to quickly evaluate the Gaussian process log-likelihood and its gradient, which we use in the context of black-box numerical optimization for parameter fitting of low-dimensional Gaussian processes. Finally, we introduce the strong recursive skeletonization factorization (RS-S), a new approximate matrix factorization based on recursive skeletonization for solving discretizations of linear integral equations associated with elliptic partial differential equations in two and three dimensions (and other matrices with similar hierarchical rank structure). RS-S uses a simple modification of skeletonization, strong skeletonization, which compresses only far-field interactions. We further combine the strong skeletonization procedure with alternating near-field compression to obtain the hybrid recursive skeletonization factorization (RS-WS), a modification of RS-S that exhibits reduced storage cost in many settings.
In the first part of the dissertation, we present a method for updating certain hierarchical factorizations for solving linear integral equations with elliptic kernels. In particular, given a factorization corresponding to some initial geometry or material parameters, we can locally perturb the geometry or coefficients and update the initial factorization to reflect this change with asymptotic complexity that is polylogarithmic in the total number of unknowns and linear in the number of perturbed unknowns. In the second part, we consider the application of hierarchical factorizations to the problem of spatial Gaussian process maximum likelihood estimation, i.e., parameter fitting for kriging. We present a framework for scattered (quasi-)two-dimensional observations using skeletonization factorizations to quickly evaluate the Gaussian process log-likelihood and its gradient, which we use in the context of black-box numerical optimization for parameter fitting of low-dimensional Gaussian processes. Finally, we introduce the strong recursive skeletonization factorization (RS-S), a new approximate matrix factorization based on recursive skeletonization for solving discretizations of linear integral equations associated with elliptic partial differential equations in two and three dimensions (and other matrices with similar hierarchical rank structure). RS-S uses a simple modification of skeletonization, strong skeletonization, which compresses only far-field interactions. We further combine the strong skeletonization procedure with alternating near-field compression to obtain the hybrid recursive skeletonization factorization (RS-WS), a modification of RS-S that exhibits reduced storage cost in many settings.
7. Dynamic panel data analytics and a martingale approach to evaluation of econometric forecasts and its applications [electronic resource] [2017] Online
- Book
- 1 online resource.
Dynamic panel data models are widely studied and used in econometrics and biostatistics. We first introduce mixed models, which have many applications in risk analytics in finance and insurance, to analyze dynamic panel data. We next develop a martingale approach to the evaluation of forecasting methods. We apply this approach to the development of statistical inference procedures for a widely used measure, the accuracy ratio, for the evaluation of probability forecasts. A comprehensive asymptotic theory for the time-adjusted accuracy ratio is established, taking care of the time series aspects of the data. Our martingale approach does not require subjective modeling of the data, such as assuming whether time series are stationary, which is an obvious advantage in regulatory environments. We carry out an empirical study on default predictions of small and medium sized enterprises. We propose using the generalized linear mixed model (GLMM) to model the default probabilities, where as the logistic regressions are the most widely used method in the literature. We use the accuracy ratio to compare the forecasting methods, and the confidence intervals for the accuracy ratios show that the GLMM outperforms the logistic regression forecasts.
Dynamic panel data models are widely studied and used in econometrics and biostatistics. We first introduce mixed models, which have many applications in risk analytics in finance and insurance, to analyze dynamic panel data. We next develop a martingale approach to the evaluation of forecasting methods. We apply this approach to the development of statistical inference procedures for a widely used measure, the accuracy ratio, for the evaluation of probability forecasts. A comprehensive asymptotic theory for the time-adjusted accuracy ratio is established, taking care of the time series aspects of the data. Our martingale approach does not require subjective modeling of the data, such as assuming whether time series are stationary, which is an obvious advantage in regulatory environments. We carry out an empirical study on default predictions of small and medium sized enterprises. We propose using the generalized linear mixed model (GLMM) to model the default probabilities, where as the logistic regressions are the most widely used method in the literature. We use the accuracy ratio to compare the forecasting methods, and the confidence intervals for the accuracy ratios show that the GLMM outperforms the logistic regression forecasts.
8. Genome-wide chromatin accessibility prediction with deep neural networks [electronic resource] [2017] Online
- Book
- 1 online resource.
Transcriptional regulation is closely associated with chromatin accessibility. Critical regulatory elements such as enhancers and repressors undergo a very dynamic chromatin structural variation to initiate gene regulation during cell differentiation. Large collaborative consortiums have widely applied the assays of measuring genome-wide chromatin accessibility onto hundreds of human cell types to investigate the mechanism of cell differentiation. Many integrative analyses focus on identifying epigenetic changes of chromatin structure that can lead to differential gene expression. However, it remains a mystery how chromatin accessibility is governed in the first place. To address this problem, we present a standard pipeline of processing sequencing data from the two most commonly adopted chromatin accessibility assays, DNase-seq and ATAC-seq, and then introduce an approach of deep neural networks (DNNs) to learn the underlying rules of chromatin regulation and predict open genomic regions in a wide variety of human cell types, using only the gene expression of transcriptional factors (TFs). We train DNNs on 682,425 enhancer related regulatory regions with 620 TFs that have sequence motifs in 100 cell conditions, and demonstrate high accuracy of prediction. Our model shows great generalizability on new cell conditions that it has not been trained on. The learned coefficients of our input features reveal useful insights into the functional roles of TFs. Therefore, our DNN offers an interpretable model of transcriptional regulation of chromatin accessibility.
Transcriptional regulation is closely associated with chromatin accessibility. Critical regulatory elements such as enhancers and repressors undergo a very dynamic chromatin structural variation to initiate gene regulation during cell differentiation. Large collaborative consortiums have widely applied the assays of measuring genome-wide chromatin accessibility onto hundreds of human cell types to investigate the mechanism of cell differentiation. Many integrative analyses focus on identifying epigenetic changes of chromatin structure that can lead to differential gene expression. However, it remains a mystery how chromatin accessibility is governed in the first place. To address this problem, we present a standard pipeline of processing sequencing data from the two most commonly adopted chromatin accessibility assays, DNase-seq and ATAC-seq, and then introduce an approach of deep neural networks (DNNs) to learn the underlying rules of chromatin regulation and predict open genomic regions in a wide variety of human cell types, using only the gene expression of transcriptional factors (TFs). We train DNNs on 682,425 enhancer related regulatory regions with 620 TFs that have sequence motifs in 100 cell conditions, and demonstrate high accuracy of prediction. Our model shows great generalizability on new cell conditions that it has not been trained on. The learned coefficients of our input features reveal useful insights into the functional roles of TFs. Therefore, our DNN offers an interpretable model of transcriptional regulation of chromatin accessibility.
9. Sparse factorizations and scalable algorithms for differential and integral operators [electronic resource] [2017] Online
- Book
- 1 online resource.
Sparse factorizations and scalable algorithms for elliptic differential operators and Fourier integral operators (FIOs) are presented in the defense. The former operators are solved by the distributed-memory hierarchical interpolative factorization (DHIF) whereas the later operators are addressed by the butterfly factorization. By exploiting locality and certain low-rank properties of the elliptic differential operators, the hierarchical interpolative factorization achieves quasi-linear complexity for factorizing the discrete positive definite elliptic differential operator and linear complexity for solving the associated linear system. In this dissertation, the DHIF is introduced as a scalable and distributed-memory implementation of the hierarchical interpolative factorization. The DHIF organizes the processes in a hierarchical structure and keeps the communication as local as possible. The computation complexity is O(N/P log N) and O(N/P) for constructing and applying the DHIF, respectively, where N is the size of the problem and P is the number of processes. Extensive numerical examples are performed on the NERSC Edison system with up to 8192 processes. The numerical results agree with the complexity analysis and demonstrate the efficiency and scalability of the DHIF. The butterfly factorization is a data-sparse approximation for the FIOs as well as other matrices that satisfy a complementary low-rank property. The factorization can be constructed efficiently if either fast algorithms for applying the matrix and its adjoint are available or the entries of the matrix can be sampled individually. For an N by N matrix, the resulting factorization is a product of O(log N) sparse matrices, each with O(N) non-zero entries. Hence, it can be applied rapidly in O(N log N) operations. Numerical results are provided to demonstrate the effectiveness of the butterfly factorization and its construction algorithms. For the kernel matrices of multidimensional FIOs, for which the complementary low-rank property is usually not satisfied due to a singularity at the origin, we extend this factorization by combining it with a multiscale decomposition of the integration domain to overcome the singularity. Numerical results are provided to demonstrate the efficiency of the proposed algorithms as well.
Sparse factorizations and scalable algorithms for elliptic differential operators and Fourier integral operators (FIOs) are presented in the defense. The former operators are solved by the distributed-memory hierarchical interpolative factorization (DHIF) whereas the later operators are addressed by the butterfly factorization. By exploiting locality and certain low-rank properties of the elliptic differential operators, the hierarchical interpolative factorization achieves quasi-linear complexity for factorizing the discrete positive definite elliptic differential operator and linear complexity for solving the associated linear system. In this dissertation, the DHIF is introduced as a scalable and distributed-memory implementation of the hierarchical interpolative factorization. The DHIF organizes the processes in a hierarchical structure and keeps the communication as local as possible. The computation complexity is O(N/P log N) and O(N/P) for constructing and applying the DHIF, respectively, where N is the size of the problem and P is the number of processes. Extensive numerical examples are performed on the NERSC Edison system with up to 8192 processes. The numerical results agree with the complexity analysis and demonstrate the efficiency and scalability of the DHIF. The butterfly factorization is a data-sparse approximation for the FIOs as well as other matrices that satisfy a complementary low-rank property. The factorization can be constructed efficiently if either fast algorithms for applying the matrix and its adjoint are available or the entries of the matrix can be sampled individually. For an N by N matrix, the resulting factorization is a product of O(log N) sparse matrices, each with O(N) non-zero entries. Hence, it can be applied rapidly in O(N log N) operations. Numerical results are provided to demonstrate the effectiveness of the butterfly factorization and its construction algorithms. For the kernel matrices of multidimensional FIOs, for which the complementary low-rank property is usually not satisfied due to a singularity at the origin, we extend this factorization by combining it with a multiscale decomposition of the integration domain to overcome the singularity. Numerical results are provided to demonstrate the efficiency of the proposed algorithms as well.
10. Tools for higher-order network analysis [electronic resource] [2017] Online
- Book
- 1 online resource.
Networks are a fundamental model of complex systems throughout the sciences, and network datasets are typically analyzed through lower-order connectivity patterns described at the level of individual nodes and edges. However, higher-order connectivity patterns captured by small subgraphs, also called network motifs, describe the fundamental structures that control and mediate the behavior of many complex systems. We develop three tools for network analysis that use higher-order connectivity patterns to gain new insights into network datasets: (1) a framework to cluster nodes into modules based on joint participation in network motifs; (2) a generalization of the clustering coefficient measurement to investigate higher-order closure patterns; and (3) a definition of network motifs for temporal networks and fast algorithms for counting them. Using these tools, we analyze data from biology, ecology, economics, neuroscience, online social networks, scientific collaborations, telecommunications, transportation, and the World Wide Web.
Networks are a fundamental model of complex systems throughout the sciences, and network datasets are typically analyzed through lower-order connectivity patterns described at the level of individual nodes and edges. However, higher-order connectivity patterns captured by small subgraphs, also called network motifs, describe the fundamental structures that control and mediate the behavior of many complex systems. We develop three tools for network analysis that use higher-order connectivity patterns to gain new insights into network datasets: (1) a framework to cluster nodes into modules based on joint participation in network motifs; (2) a generalization of the clustering coefficient measurement to investigate higher-order closure patterns; and (3) a definition of network motifs for temporal networks and fast algorithms for counting them. Using these tools, we analyze data from biology, ecology, economics, neuroscience, online social networks, scientific collaborations, telecommunications, transportation, and the World Wide Web.
11. Algebraic theory of probabilistic machines [electronic resource] [2016] Online
- Book
- 1 online resource.
A definition of a probabilistic automaton is formulated in which its prime decomposition follows as a direct consequence of Krohn-Rhodes theorem. We characterize Green's relations on the monoid of stochastic matrices. The reduced holonomy monoid is induced by the holonomy decomposition of Eilenberg, which is a variant of the prime decomposition. We prove that its representation theory is determined by an iterated wreath product of its holonomy groups.
A definition of a probabilistic automaton is formulated in which its prime decomposition follows as a direct consequence of Krohn-Rhodes theorem. We characterize Green's relations on the monoid of stochastic matrices. The reduced holonomy monoid is induced by the holonomy decomposition of Eilenberg, which is a variant of the prime decomposition. We prove that its representation theory is determined by an iterated wreath product of its holonomy groups.
12. Convex optimization for Monte Carlo [electronic resource] : stochastic optimization for importance sampling [2016] Online
- Book
- 1 online resource.
In Monte Carlo simulations it is often essential that a method is accompanied by an appropriate variance reduction method. Reducing the variance of a Monte Carlo method is, at least conceptually, an optimization problem, and mathematical optimization has indeed been used as a theoretical and conceptual tool in this pursuit. However, traditional Monte Carlo methods have only used numerical optimization sparingly, and convex optimization even less. Numerical optimization is study of algorithms for finding a solution to an optimization problem, as opposed to the study of analytical solutions of an optimization problem. Convex optimization is the study of convex optimization problems, a subclass of optimization problems for which efficient algorithms for finding the global optimum exists. In this work we present a framework for using convex optimization for Monte Carlo. More specifically, we present a framework for using stochastic convex optimization for adaptive importance sampling, self-normalized importance sampling, and what-if simulations. The main idea is to perform importance sampling and numerical optimization simultaneously. In particular, the numerical optimization does not rely on blackbox optimization solvers, and this allows the computational cost of each iteration to remain cheap. Because the optimization is performed on a convex problem, we can establish convergence and optimality.
In Monte Carlo simulations it is often essential that a method is accompanied by an appropriate variance reduction method. Reducing the variance of a Monte Carlo method is, at least conceptually, an optimization problem, and mathematical optimization has indeed been used as a theoretical and conceptual tool in this pursuit. However, traditional Monte Carlo methods have only used numerical optimization sparingly, and convex optimization even less. Numerical optimization is study of algorithms for finding a solution to an optimization problem, as opposed to the study of analytical solutions of an optimization problem. Convex optimization is the study of convex optimization problems, a subclass of optimization problems for which efficient algorithms for finding the global optimum exists. In this work we present a framework for using convex optimization for Monte Carlo. More specifically, we present a framework for using stochastic convex optimization for adaptive importance sampling, self-normalized importance sampling, and what-if simulations. The main idea is to perform importance sampling and numerical optimization simultaneously. In particular, the numerical optimization does not rely on blackbox optimization solvers, and this allows the computational cost of each iteration to remain cheap. Because the optimization is performed on a convex problem, we can establish convergence and optimality.
13. Factor models, mean-reversion time, and statistical arbitrage [electronic resource] [2016] Online
- Book
- 1 online resource.
Financial markets are well-defined complex systems, which are continuously monitored and recorded, and generate tremendous amounts of data. This thesis is an attempt to analyze financial markets as high-dimensional and dynamic systems. In dealing with high-dimensional data sets, factor models are often useful for dimension reduction. In the first part of this thesis, we present a new approach to estimate high-dimensional factor models, using the empirical spectral density of residuals. The spectrum of covariance matrices from financial data typically exhibits two characteristic aspects: a few spikes and bulk. The former represent factors that mainly drive the features and the latter arises from idiosyncratic noise. Motivated by these two aspects, we consider a minimum distance between two spectrums; one from a covariance structure model and the other from real residuals of financial data that are obtained by subtracting principal components. Our method simultaneously provides estimators of the number of factors and information about correlation structures in residuals. Using free random variable techniques, the proposed algorithm can be implemented and controlled effectively. Monte Carlo simulations confirm that our method is robust to noise or the presence of weak factors. Furthermore, the application to financial time-series shows that our estimators capture essential aspects of market dynamics. In the second part of this thesis, we turn to pay attention to more practical applications using residual dynamics. In this part, we study the risk of mean-reversions in statistical arbitrage. The basic concept of statistical arbitrage is to exploit short-term deviations in returns from a long-term equilibrium across several assets. This kind of strategy heavily relies on the assumption of mean-reversion of idiosyncratic returns - reverting to a long-term mean after a certain amount of time, but literature on the assessment of risk on this belief is rare. In this chapter, we propose a scheme that controls the risk on mean-reversions, via portfolio selections and screenings. Realizing that each residual has a different mean-reversion time, the residuals that are fast mean-reverting are selected to form portfolios. In addition, further control is imposed by allowing the trading activity only when the goodness-of-fit of the estimation for trading signals is sufficiently high. We design the dynamic asset allocation with the market- and dollar- neutrality conditions as a constrained optimization problem, which is solved numerically. The improved reliability and robustness of the strategy of our method is demonstrated through back-testing with real data. It is found that the performance of this investment framework is robust to market changes. We further provide answers to puzzles regarding relations among the number of factors, length of estimation window, and transaction costs, which are crucial parameters that have direct impacts on the investment strategy. In the third part, we consider the Wishart processes, which is a good candidate for modeling a stochastic covariance matrix. Specifically, we use perturbation theory to derive the eigenvalues and eigenvectors of the Wishart processes. These processes have some interesting features where each eigenvalue is part of a multi-dimensional stochastic systems of non-colliding particles, which has been widely studied in mathematical physics. In addition, we implement a numerical scheme to effectively simulate eigenvalues of the square Ornstein-Uhlenbeck processes. In this scheme, the Yosida approximation was exploited to resolve the difficulty that arises from the singular drift terms. The results in this chapter contribute to the dynamic modeling of stochastic covariance matrices and their financial applications.
Financial markets are well-defined complex systems, which are continuously monitored and recorded, and generate tremendous amounts of data. This thesis is an attempt to analyze financial markets as high-dimensional and dynamic systems. In dealing with high-dimensional data sets, factor models are often useful for dimension reduction. In the first part of this thesis, we present a new approach to estimate high-dimensional factor models, using the empirical spectral density of residuals. The spectrum of covariance matrices from financial data typically exhibits two characteristic aspects: a few spikes and bulk. The former represent factors that mainly drive the features and the latter arises from idiosyncratic noise. Motivated by these two aspects, we consider a minimum distance between two spectrums; one from a covariance structure model and the other from real residuals of financial data that are obtained by subtracting principal components. Our method simultaneously provides estimators of the number of factors and information about correlation structures in residuals. Using free random variable techniques, the proposed algorithm can be implemented and controlled effectively. Monte Carlo simulations confirm that our method is robust to noise or the presence of weak factors. Furthermore, the application to financial time-series shows that our estimators capture essential aspects of market dynamics. In the second part of this thesis, we turn to pay attention to more practical applications using residual dynamics. In this part, we study the risk of mean-reversions in statistical arbitrage. The basic concept of statistical arbitrage is to exploit short-term deviations in returns from a long-term equilibrium across several assets. This kind of strategy heavily relies on the assumption of mean-reversion of idiosyncratic returns - reverting to a long-term mean after a certain amount of time, but literature on the assessment of risk on this belief is rare. In this chapter, we propose a scheme that controls the risk on mean-reversions, via portfolio selections and screenings. Realizing that each residual has a different mean-reversion time, the residuals that are fast mean-reverting are selected to form portfolios. In addition, further control is imposed by allowing the trading activity only when the goodness-of-fit of the estimation for trading signals is sufficiently high. We design the dynamic asset allocation with the market- and dollar- neutrality conditions as a constrained optimization problem, which is solved numerically. The improved reliability and robustness of the strategy of our method is demonstrated through back-testing with real data. It is found that the performance of this investment framework is robust to market changes. We further provide answers to puzzles regarding relations among the number of factors, length of estimation window, and transaction costs, which are crucial parameters that have direct impacts on the investment strategy. In the third part, we consider the Wishart processes, which is a good candidate for modeling a stochastic covariance matrix. Specifically, we use perturbation theory to derive the eigenvalues and eigenvectors of the Wishart processes. These processes have some interesting features where each eigenvalue is part of a multi-dimensional stochastic systems of non-colliding particles, which has been widely studied in mathematical physics. In addition, we implement a numerical scheme to effectively simulate eigenvalues of the square Ornstein-Uhlenbeck processes. In this scheme, the Yosida approximation was exploited to resolve the difficulty that arises from the singular drift terms. The results in this chapter contribute to the dynamic modeling of stochastic covariance matrices and their financial applications.
14. Mathematical modeling for public health policy in resource limited settings [electronic resource] [2016] Online
- Book
- 1 online resource.
The goal of the research presented in this dissertation is to present a data driven policy making approach in three different settings. In the first setting, we look at using data to inform policies to supply supplementary food aid for undernourished children in Guatemala. We fit a trivariate model of weight-for-age z score (WAZ), height-for-age z score (HAZ) and diarrhea status to data from an observational study of supplemen- tary feeding in 17 Guatemalan communities. We estimate how the effect of supplementary food on WAZ, HAZ and diarrhea status varies with a child's age. We find that the effect of supplementary food on all 3 metrics decreases linearly with age from 6 to 20 mo and has little effect after 20 mo. We derive 2 food allocation policies that myopically (i.e., looking ahead 2 mo) minimize either the underweight or stunting severity -- i.e., the sum of squared WAZ or HAZ scores for all children with WAZ or HAZ < 0. A simulation study based on the statistical model predicts that in a low-dose (100 kCal/day) supplementary feeding setting in Guatemala, allocating food primarily to 6-12 mo infants can reduce the severity of underweight and stunting. In the second setting, we look at analyzing how targeting interventions to the most under- nourished would perform in reducing the malaria burden amongst children in sub-Saharan Africa. We construct a malaria model with superinfection and heterogeneous susceptibility, where a portion of this susceptibility is due to undernutrition (as measured by weight-for-age z scores); so as to isolate the impact of supplementary food on malaria from the influence of confounding factors, we estimate the portion of the total susceptibility that is due to undernutrition from a large randomized trial of supplementary feeding. A simulation study based on the malaria model suggests that targeting insecticide-treated bed nets to undernutritioned children leads to fewer malaria deaths than the random distribution of bed nets in the hypoendemic and mesoendemic settings. In the third setting, we look to quantify Tuberculosis (TB) transmission risk within and outside cells using high-resolution social contact data of prisoner in a large prison in the state of Mato Grosso do Sul, Brazil. We then develop and parameterize a model of TB natural history and transmission to evaluate the impact of TB control interventions in prisons on TB incidence amongst the prisoners. A simulation study using this model suggests that any single intervention (improved case detection, decrowding or active case detection) would be insufficient to bring the TB incidence back to the TB incidence levels of the early 2000s but a combination of interventions would be necessary. The results of these studies demonstrate that mathematical modeling can be a powerful tool in understanding the interplaying factors behind the policy in question and help policy makers make informed decisions.
The goal of the research presented in this dissertation is to present a data driven policy making approach in three different settings. In the first setting, we look at using data to inform policies to supply supplementary food aid for undernourished children in Guatemala. We fit a trivariate model of weight-for-age z score (WAZ), height-for-age z score (HAZ) and diarrhea status to data from an observational study of supplemen- tary feeding in 17 Guatemalan communities. We estimate how the effect of supplementary food on WAZ, HAZ and diarrhea status varies with a child's age. We find that the effect of supplementary food on all 3 metrics decreases linearly with age from 6 to 20 mo and has little effect after 20 mo. We derive 2 food allocation policies that myopically (i.e., looking ahead 2 mo) minimize either the underweight or stunting severity -- i.e., the sum of squared WAZ or HAZ scores for all children with WAZ or HAZ < 0. A simulation study based on the statistical model predicts that in a low-dose (100 kCal/day) supplementary feeding setting in Guatemala, allocating food primarily to 6-12 mo infants can reduce the severity of underweight and stunting. In the second setting, we look at analyzing how targeting interventions to the most under- nourished would perform in reducing the malaria burden amongst children in sub-Saharan Africa. We construct a malaria model with superinfection and heterogeneous susceptibility, where a portion of this susceptibility is due to undernutrition (as measured by weight-for-age z scores); so as to isolate the impact of supplementary food on malaria from the influence of confounding factors, we estimate the portion of the total susceptibility that is due to undernutrition from a large randomized trial of supplementary feeding. A simulation study based on the malaria model suggests that targeting insecticide-treated bed nets to undernutritioned children leads to fewer malaria deaths than the random distribution of bed nets in the hypoendemic and mesoendemic settings. In the third setting, we look to quantify Tuberculosis (TB) transmission risk within and outside cells using high-resolution social contact data of prisoner in a large prison in the state of Mato Grosso do Sul, Brazil. We then develop and parameterize a model of TB natural history and transmission to evaluate the impact of TB control interventions in prisons on TB incidence amongst the prisoners. A simulation study using this model suggests that any single intervention (improved case detection, decrowding or active case detection) would be insufficient to bring the TB incidence back to the TB incidence levels of the early 2000s but a combination of interventions would be necessary. The results of these studies demonstrate that mathematical modeling can be a powerful tool in understanding the interplaying factors behind the policy in question and help policy makers make informed decisions.
15. Modeling information flow in networks [electronic resource] : competition, evolution, and external influence [2016] Online
- Book
- 1 online resource.
In online social networks such as Twitter and Facebook, users are constantly sharing information with people they are connected to, as well as re-sharing information posted by others. Through this process, a single piece of information called a contagion can spread from user to user over the connections until it has reached a large portion of the network. In this thesis, we develop a series of probabilistic methods for modeling the spread of contagions in social networks in order to better understand the factors that affect the process. Our work examines several different phenomena that affect information flows through social networks. One such phenomenon is unobserved sources of information influencing members of the network. We present a model that not only quantifies these hidden information sources but also provides a more accurate view of information spread. We find that as much as 29% of all information spreading through social networks like Twitter originates from sources outside the network. Next, we examine how different contagions spreading through a network can interact with each other. We observe and model competition (when one contagion can decrease the spread of another contagion) and cooperation (when one contagion increases the spread of another). We find that contagion interaction can increase or decrease the probability of contagion spread by more than 70% on average. We also explore the dynamic nature of social network structure, and how these dynamics are affected by the spread of information. As social network users are exposed to new contagions, they are constantly forming new connections with other users as well as deleting connections. We find that the spread of contagions can cause sudden "bursts" in both the creation of new connections and the deletion of old connections. We also find that contagions can change the network structure on a global scale by moving like-minded members closer to each other as well as pushing less similar users farther away. Additionally, we consider the problem of inferring the structure of a hidden network when only patterns of information spread are known. Given only the timing of when each user adopted each piece of information, our method accurately predicts which users are connected to which other users by converting the maximum likelihood estimation of the network into a series of independent convex optimization problems that can be solved efficiently. Taken together, the results presented in this thesis make contributions to understanding how information flows through networks, and they provide insights into the nature of how people exchange information.
In online social networks such as Twitter and Facebook, users are constantly sharing information with people they are connected to, as well as re-sharing information posted by others. Through this process, a single piece of information called a contagion can spread from user to user over the connections until it has reached a large portion of the network. In this thesis, we develop a series of probabilistic methods for modeling the spread of contagions in social networks in order to better understand the factors that affect the process. Our work examines several different phenomena that affect information flows through social networks. One such phenomenon is unobserved sources of information influencing members of the network. We present a model that not only quantifies these hidden information sources but also provides a more accurate view of information spread. We find that as much as 29% of all information spreading through social networks like Twitter originates from sources outside the network. Next, we examine how different contagions spreading through a network can interact with each other. We observe and model competition (when one contagion can decrease the spread of another contagion) and cooperation (when one contagion increases the spread of another). We find that contagion interaction can increase or decrease the probability of contagion spread by more than 70% on average. We also explore the dynamic nature of social network structure, and how these dynamics are affected by the spread of information. As social network users are exposed to new contagions, they are constantly forming new connections with other users as well as deleting connections. We find that the spread of contagions can cause sudden "bursts" in both the creation of new connections and the deletion of old connections. We also find that contagions can change the network structure on a global scale by moving like-minded members closer to each other as well as pushing less similar users farther away. Additionally, we consider the problem of inferring the structure of a hidden network when only patterns of information spread are known. Given only the timing of when each user adopted each piece of information, our method accurately predicts which users are connected to which other users by converting the maximum likelihood estimation of the network into a series of independent convex optimization problems that can be solved efficiently. Taken together, the results presented in this thesis make contributions to understanding how information flows through networks, and they provide insights into the nature of how people exchange information.
16. Polynomial and rational approximation techniques for non-intrusive uncertainty quantification [electronic resource] [2016] Online
- Book
- 1 online resource.
With the ever-increasing power of computers and the advent parallel computing in the recent decades, scientists and engineers are relying more and more on numerical simulations in their studies of complex physical systems. Given the central role that simulations play in engineering design and decision making, it is crucial to assign confidence in their outputs. One important factor that leads to uncertainty in the output quantity of interest in a physical system, is uncertainty in the inputs like properties of materials, manufacturing details, initial and boundary conditions, \emph{etc}. Our goal in uncertainty quantification is (by definition) to quantify the effects of these input uncertainties on the output quantity of interest. In other words, we are trying to describe the behavior of this output variable as a function of the input uncertainties. In \emph{intrusive} UQ strategies, we solve the governing equations of the physical system in terms of both physical and uncertain variables. Solving these equations is usually significantly more challenging than solving the original (deterministic) equations, and often require writing new code that is substantially different from the deterministic solver. In \emph{non-intrusive} UQ on the other hand, we run the deterministic code for various values of the input parameters, and use the outputs of these simulations to construct an approximation for the behavior of the output quantity of interest as a function of the input uncertainties. Although non-intrusive methods come in many flavors, they are all based on some variation of the following fundamental problem in approximation theory: given the values of a function at a set of points in its domain, how can we efficiently and accurately approximate that function? In this dissertation, we study several variants of this problem. Sometimes, we are free in choosing the points at which the function is evaluated (\emph{i.e.}, the values of the input parameters for which we run the deterministic simulation). In this scenario, we need to choose the points in a way that, given a certain number of function evaluations, we can get the best quality of approximation. As an instance of the problem in this setting, we will look at a non-intrusive \emph{polynomial chaos expansion} (PCE) technique, in which we use weighted least squares to construct a multivariate polynomial surrogate. We present a novel optimization based method for finding the best points for this type of approximation. We are not always free to choose the grid points, and sometimes have to find the best approximation that we can, using a fixed set of points that are just given to us. For these problems, in univariate settings, we present an efficient and accurate method based on the \emph{Floater-Hormann} rational interpolation. For multivariate settings, we present a generalization of the nearest neighbor interpolation based on $L_1$ minimization. This method has similar convergence properties to those of the \emph{moving least squares} method, but unlike moving least squares, it does not come with any tunable parameters. We will also look at a hybrid setting, where some of the points are fixed, and we are free in choosing the rest. Assume that we have found the polynomial interpolant a function at a set of \emph{Chebyshev} points, and decide that we need to use a higher order polynomial interpolant. We present a method for finding the best points to use for finding the higher order interpolant, under the constraint that the previous set of points have to be reused.
With the ever-increasing power of computers and the advent parallel computing in the recent decades, scientists and engineers are relying more and more on numerical simulations in their studies of complex physical systems. Given the central role that simulations play in engineering design and decision making, it is crucial to assign confidence in their outputs. One important factor that leads to uncertainty in the output quantity of interest in a physical system, is uncertainty in the inputs like properties of materials, manufacturing details, initial and boundary conditions, \emph{etc}. Our goal in uncertainty quantification is (by definition) to quantify the effects of these input uncertainties on the output quantity of interest. In other words, we are trying to describe the behavior of this output variable as a function of the input uncertainties. In \emph{intrusive} UQ strategies, we solve the governing equations of the physical system in terms of both physical and uncertain variables. Solving these equations is usually significantly more challenging than solving the original (deterministic) equations, and often require writing new code that is substantially different from the deterministic solver. In \emph{non-intrusive} UQ on the other hand, we run the deterministic code for various values of the input parameters, and use the outputs of these simulations to construct an approximation for the behavior of the output quantity of interest as a function of the input uncertainties. Although non-intrusive methods come in many flavors, they are all based on some variation of the following fundamental problem in approximation theory: given the values of a function at a set of points in its domain, how can we efficiently and accurately approximate that function? In this dissertation, we study several variants of this problem. Sometimes, we are free in choosing the points at which the function is evaluated (\emph{i.e.}, the values of the input parameters for which we run the deterministic simulation). In this scenario, we need to choose the points in a way that, given a certain number of function evaluations, we can get the best quality of approximation. As an instance of the problem in this setting, we will look at a non-intrusive \emph{polynomial chaos expansion} (PCE) technique, in which we use weighted least squares to construct a multivariate polynomial surrogate. We present a novel optimization based method for finding the best points for this type of approximation. We are not always free to choose the grid points, and sometimes have to find the best approximation that we can, using a fixed set of points that are just given to us. For these problems, in univariate settings, we present an efficient and accurate method based on the \emph{Floater-Hormann} rational interpolation. For multivariate settings, we present a generalization of the nearest neighbor interpolation based on $L_1$ minimization. This method has similar convergence properties to those of the \emph{moving least squares} method, but unlike moving least squares, it does not come with any tunable parameters. We will also look at a hybrid setting, where some of the points are fixed, and we are free in choosing the rest. Assume that we have found the polynomial interpolant a function at a set of \emph{Chebyshev} points, and decide that we need to use a higher order polynomial interpolant. We present a method for finding the best points to use for finding the higher order interpolant, under the constraint that the previous set of points have to be reused.
17. Probability distribution methods for nonlinear transport in heterogeneous porous media [electronic resource] [2016] Online
- Book
- 1 online resource.
Because geophysical data are inexorably sparse and incomplete, stochastic treatments of simulated responses are crucial to explore possible scenarios and assess risks in subsurface problems. In particular, nonlinear two-phase flow in porous media is essential, yet challenging, in reservoir simulation and hydrology. Adding highly heterogeneous and uncertain input, such as the permeability and porosity fields, transforms the estimation of the flow response into a tough stochastic problem for which computationally expensive Monte Carlo simulations remain the preferred option. In this thesis, we first propose an alternative approach to evaluate the probability distribution of the (water) saturation for nonlinear transport in strongly heterogeneous porous systems. We build a physics-based, computationally efficient and numerically accurate method to estimate the one-point probability density and cumulative distribution functions of the saturation. The distribution method draws inspiration from a Lagrangian approach of the stochastic transport problem and expresses the saturation probability density function and cumulative distribution function essentially in terms of a deterministic nonlinear mapping of scalar random fields. In a large class of applications these random fields are smooth and can be estimated at low computational costs (few Monte Carlo runs), thus making the distribution method attractive. Once the saturation distribution is determined, any one-point statistics thereof can be obtained, especially the saturation average and standard deviation. More importantly, the probability of rare events and saturation quantiles (e.g. P10, P50 and P90) can be efficiently derived from the distribution method. These statistics can then be used for risk assessment, as well as data assimilation and uncertainty reduction in the prior knowledge of geophysical input distributions. We provide various examples and comparisons with existing methods to illustrate the performance and applicability of the new developed method. In the second part of the thesis, we present a procedure to analytically obtain the multi-point cumulative distribution function of the saturation for the stochastic two-phase Buckley-Leverett model with random total-velocity field. The multi-point distribution function is determined by first deriving a partial differential equation for the saturation raw cumulative distribution function at each point, then combining these equations into a single partial differential equation for the multi-point raw cumulative distribution function. This latter stochastic partial differential equation, linear in the space-time variables, can be solved in a closed form and semi-analytically for spatial one dimensional problems or numerically for higher spatial dimensions. Finally, the ensemble average of its solution gives the saturation multi-point cumulative distribution function. We provide numerical results of distribution function profiles in one spatial dimension and for two points. Besides, we use the two-point distribution method to compute the saturation auto-covariance function, essential for data assimilation. We confirm the validity of the method by comparing covariance results obtained with the multipoint distribution method and Monte Carlo simulations.
Because geophysical data are inexorably sparse and incomplete, stochastic treatments of simulated responses are crucial to explore possible scenarios and assess risks in subsurface problems. In particular, nonlinear two-phase flow in porous media is essential, yet challenging, in reservoir simulation and hydrology. Adding highly heterogeneous and uncertain input, such as the permeability and porosity fields, transforms the estimation of the flow response into a tough stochastic problem for which computationally expensive Monte Carlo simulations remain the preferred option. In this thesis, we first propose an alternative approach to evaluate the probability distribution of the (water) saturation for nonlinear transport in strongly heterogeneous porous systems. We build a physics-based, computationally efficient and numerically accurate method to estimate the one-point probability density and cumulative distribution functions of the saturation. The distribution method draws inspiration from a Lagrangian approach of the stochastic transport problem and expresses the saturation probability density function and cumulative distribution function essentially in terms of a deterministic nonlinear mapping of scalar random fields. In a large class of applications these random fields are smooth and can be estimated at low computational costs (few Monte Carlo runs), thus making the distribution method attractive. Once the saturation distribution is determined, any one-point statistics thereof can be obtained, especially the saturation average and standard deviation. More importantly, the probability of rare events and saturation quantiles (e.g. P10, P50 and P90) can be efficiently derived from the distribution method. These statistics can then be used for risk assessment, as well as data assimilation and uncertainty reduction in the prior knowledge of geophysical input distributions. We provide various examples and comparisons with existing methods to illustrate the performance and applicability of the new developed method. In the second part of the thesis, we present a procedure to analytically obtain the multi-point cumulative distribution function of the saturation for the stochastic two-phase Buckley-Leverett model with random total-velocity field. The multi-point distribution function is determined by first deriving a partial differential equation for the saturation raw cumulative distribution function at each point, then combining these equations into a single partial differential equation for the multi-point raw cumulative distribution function. This latter stochastic partial differential equation, linear in the space-time variables, can be solved in a closed form and semi-analytically for spatial one dimensional problems or numerically for higher spatial dimensions. Finally, the ensemble average of its solution gives the saturation multi-point cumulative distribution function. We provide numerical results of distribution function profiles in one spatial dimension and for two points. Besides, we use the two-point distribution method to compute the saturation auto-covariance function, essential for data assimilation. We confirm the validity of the method by comparing covariance results obtained with the multipoint distribution method and Monte Carlo simulations.
- Book
- 1 online resource.
As the ability to generate datasets with billions of records in relatively automated ways continues to grow, new challenges are posed to modern large-scale data analysis from several points of view such as scalability, feasibility, and interpretability. Thus, improved algorithms on large-scale data platforms are of interest. Recent years have seen great interest in Randomized Linear Algebra (RLA) algorithms. RLA is an interdisciplinary research area that exploits randomization as a computational resource for the development of improved algorithms for common matrix problems such as least-squares approximation, low-rank matrix approximation, and Laplacian-based linear equation solvers. In this thesis, our focus is on the underlying theory and practical implementation of RLA algorithms. In particular: Chapter 3 describes a novel sampling algorithm for large-scale over-determined quantile regression problems whose running time is roughly proportional to the number of non-zero elements in the matrix plus a term that depends on the low dimension only. Chapter 4 describes a hybrid algorithm named pwSGD --- precondition weighted stochastic gradient descent that combines RLA and SGD. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Chapter 5 describes a class of randomized Newton-type algorithms that exploit non-uniform sub-sampling as well as inexact updates for a class of constrained optimization problems. We show that our methods are more robust to ill-conditioned problems than other similar previous approaches. Chapter 6 presents results of implementing RLA algorithms for least-squares problems, quantile regression, and CX decomposition problems in parallel and distributed environments using Apache Spark, and discuss various tradeoffs in the implementations. In particular, we demonstrate that least-squares problems with up to terabyte-sized data can be solved to low, medium, or high precision on existing distributed systems. Chapter 7 demonstrates that applying CX/CUR decomposition to large-scale Mass Spectrometry Imaging (MSI) datasets returns interpretable results as it successfully identifies important ions and locations in complex biological samples. All in all, we show that RLA is a powerful tool for deriving scalable algorithms for large-scale regression and optimization problems, and we demonstrate that RLA algorithms are amenable to distributed computing platforms and are useful in scientific applications.
As the ability to generate datasets with billions of records in relatively automated ways continues to grow, new challenges are posed to modern large-scale data analysis from several points of view such as scalability, feasibility, and interpretability. Thus, improved algorithms on large-scale data platforms are of interest. Recent years have seen great interest in Randomized Linear Algebra (RLA) algorithms. RLA is an interdisciplinary research area that exploits randomization as a computational resource for the development of improved algorithms for common matrix problems such as least-squares approximation, low-rank matrix approximation, and Laplacian-based linear equation solvers. In this thesis, our focus is on the underlying theory and practical implementation of RLA algorithms. In particular: Chapter 3 describes a novel sampling algorithm for large-scale over-determined quantile regression problems whose running time is roughly proportional to the number of non-zero elements in the matrix plus a term that depends on the low dimension only. Chapter 4 describes a hybrid algorithm named pwSGD --- precondition weighted stochastic gradient descent that combines RLA and SGD. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Chapter 5 describes a class of randomized Newton-type algorithms that exploit non-uniform sub-sampling as well as inexact updates for a class of constrained optimization problems. We show that our methods are more robust to ill-conditioned problems than other similar previous approaches. Chapter 6 presents results of implementing RLA algorithms for least-squares problems, quantile regression, and CX decomposition problems in parallel and distributed environments using Apache Spark, and discuss various tradeoffs in the implementations. In particular, we demonstrate that least-squares problems with up to terabyte-sized data can be solved to low, medium, or high precision on existing distributed systems. Chapter 7 demonstrates that applying CX/CUR decomposition to large-scale Mass Spectrometry Imaging (MSI) datasets returns interpretable results as it successfully identifies important ions and locations in complex biological samples. All in all, we show that RLA is a powerful tool for deriving scalable algorithms for large-scale regression and optimization problems, and we demonstrate that RLA algorithms are amenable to distributed computing platforms and are useful in scientific applications.
19. Spectral sequences and applied topology [electronic resource] [2016] Online
- Book
- 1 online resource.
In this thesis we investigate the utility of spectral sequences for persistent homology, a tool which characterizes the shape of data. We show how to use these spectral sequences to compute persistent homology in parallel.
In this thesis we investigate the utility of spectral sequences for persistent homology, a tool which characterizes the shape of data. We show how to use these spectral sequences to compute persistent homology in parallel.
20. Studies in covariance estimation and applications in finance [electronic resource] [2016] Online
- Book
- 1 online resource.
This thesis examines estimation of covariance and correlation matrices. More specifically we will in the first part study dynamical properties of the top eigenvalue and eigenvector for sample estimators of covariance and correlation matrices. This is done under the assumption that the top eigenvalue is separated from the others, which is reasonable when the data comes from financial returns. We show exactly how these quantities behave when the true covariance or correlation is stationary and derive theoretical values of related quantities that can be useful when quantifying the amount of non-stationarity for real data. We also validate the results by using Monte-Carlo simulations. A major contribution from the analysis is that it shows how and under which regimes correlation matrices differ from covariance matrices from a dynamic viewpoint. This effect has been observed in financial data, but never explained. In the second part of the thesis we study modifications to covariance estimators that find the optimal estimator within a certain sub-class. This type of estimators is generally known as shrinkage estimators as they modify only eigenvalues of the original estimator. We will do this when the original estimator takes the form A1/2XBXT A1/2, where A and B are matrices and X is a matrix of i.i.d. variables. The analysis is done in the asymptotic limit where both the number of samples and variables approach infinity jointly so that random-matrix theory can be used. Our goal is to find the shrinkage estimator which minimizes expected value of the Frobenius norm between the estimator and the true covariance matrix. To do this we first derive a generalization to the Marchenko-Pastur equation for this class of estimators. This theorem allows us to calculate the asymptotic value of the projection of the sample eigenvectors onto the true covariance matrix. We then show how to use these to find the optimal covariance estimator. At last, we show with simulations that these estimators are close to the optimal bound when used on finite data sets.
This thesis examines estimation of covariance and correlation matrices. More specifically we will in the first part study dynamical properties of the top eigenvalue and eigenvector for sample estimators of covariance and correlation matrices. This is done under the assumption that the top eigenvalue is separated from the others, which is reasonable when the data comes from financial returns. We show exactly how these quantities behave when the true covariance or correlation is stationary and derive theoretical values of related quantities that can be useful when quantifying the amount of non-stationarity for real data. We also validate the results by using Monte-Carlo simulations. A major contribution from the analysis is that it shows how and under which regimes correlation matrices differ from covariance matrices from a dynamic viewpoint. This effect has been observed in financial data, but never explained. In the second part of the thesis we study modifications to covariance estimators that find the optimal estimator within a certain sub-class. This type of estimators is generally known as shrinkage estimators as they modify only eigenvalues of the original estimator. We will do this when the original estimator takes the form A1/2XBXT A1/2, where A and B are matrices and X is a matrix of i.i.d. variables. The analysis is done in the asymptotic limit where both the number of samples and variables approach infinity jointly so that random-matrix theory can be used. Our goal is to find the shrinkage estimator which minimizes expected value of the Frobenius norm between the estimator and the true covariance matrix. To do this we first derive a generalization to the Marchenko-Pastur equation for this class of estimators. This theorem allows us to calculate the asymptotic value of the projection of the sample eigenvectors onto the true covariance matrix. We then show how to use these to find the optimal covariance estimator. At last, we show with simulations that these estimators are close to the optimal bound when used on finite data sets.