1  20
Next
Number of results to display per page
 Chin, Alex, author.
 [Stanford, California] : [Stanford University], 2019.
 Description
 Book — 1 online resource.
 Summary

This thesis presents new methodology for handling interference in randomized experiments. Interference, a phenomenon in which individuals interact with each other, is widely prevalent in the social and natural sciences, and has major implications for how experiments are optimally designed and analyzed. I first provide an introduction to interference, including examples and a relevant brief history of causal inference. Next, I demonstrate how researchers can use Stein's method to establish limiting distributional results for estimators under interference. The modern tools afforded by Stein's method allow one to analyze certain regimes of arbitrarily dense interference, which goes beyond the analysis capabilities of existing tools. In the subsequent chapter, I develop new modelbased, adjustment estimators for estimating the global average treatment effect. The adjustment variables can be constructed from functions of the treatment assignment vector, and the researcher can use a collection of any functions correlated with the response, turning the problem of detecting interference into a feature engineering problem. The final chapter proposes new methods for designing and analyzing stochastic seeding strategies, which are an appealing way of leveraging network structure for marketing, public health, and behavioral interventions. New importance sampling estimators adapted to this setting can greatly improve precision over existing approaches. This thesis is interdisciplinary in nature. Stein's method (Chapter 2), regression adjustments (Chapter 3), and importance sampling (Chapter 4) all command spheres of influence in certain sectors of the literature, and are here repurposed in new domains. I hope that my work shows how existing statistical technology can arise in new arenas of application while simultaneously giving rise to new methodological questions and problems, and in this way, I hope my work is useful for both practitioners and methodologists.
 Also online at

 Bi, Nan, author.
 [Stanford, California] : [Stanford University], 2019.
 Description
 Book — 1 online resource.
 Summary

This thesis addresses problems in statistical inference after model selection procedures. The framework we adopt throughout the discussion is selective inference, which provides with valid inference conditioning on the model selection event. Chapter 1 gives a background introduction to the problem of interest and the guiding principle of selective inference, especially inference with randomization. Chapter 2 introduces the framework of inferactive data analysis, sonamed to emphasize on inference after interactive data analysis. Chapter 3 discusses the problem of valid inference for the treatment effect after selecting invalid instrumental variables via a datadriven Lasso type selection procedure called SisVive. Instrumental variables models are widely used in Economics as well as Mendelian randomization in Genetics, and our method would be helpful for the practical use of instrument variables when it is not certain whether they are all valid or not. Our approach is conditional inference via selective inference with randomization, and fits into the general data analysis framework discussed in Chapter 2. We demonstrate the inference method through a development economics dataset and also a Mendelian randomization dataset with only summary statistics. Chapter 4 discusses the problem of valid inference for the treatment effect after pretesting the strengths of instrumental variables via an F test. This is a widely used screening step in practical instrument variables data analysis, and people would only proceed to conduct inference and report results if the dataset passed the pretest. We will show the common practice of ignoring the selection effect could result in significant bias in certain scenarios, while our inference method will correct for it. Again we adopt the conditional inference approach and demonstrate the method through two educational economics datasets.
 Also online at

Online 3. An approximationbased framework for postselective inference [2018]
 Panigrahi, Snigdha, author.
 [Stanford, California] : [Stanford University], 2018.
 Description
 Book — 1 online resource.
 Summary

This thesis discusses an approximationbased framework for post selective inference. Such approximations make it possible to bypass the intractability of randomized likelihoods in frequentist inference and posteriors formed by appending truncated likelihoods with priors in a Bayesian postselective framework. The computational bottleneck in computing the conditional likelihood of data post randomized selection strategies is that after conditioning out nuisance parameters, such likelihoods no longer reduce to univariate truncated Gaussian laws as in Lee et al. (2016). In fact, the conditional likelihood with a randomized response is mostly not available in closed form expressions, thereby, demanding tools to make inference based on such selectionmodified laws tractable. Intractability of the conditional likelihood, again poses major computational hurdles while providing inference based on a Bayesian model after mining the data where exploration may allow the analyst to discover parameterizations of interest and elicit plausible models on the joint space of data and parameters. Adopting Yekutieli (2012) ideas, where a Bayesian model post selection consists of a prior and a truncated likelihood, the resulting posterior distribution is affected by the very fact that selection was applied unlike in the setup usually considered when performing Bayesian variable selection. At the core of the methodology introduced in this thesis is a convex approximation to the truncated likelihood, which facilitates sampling from an approximate adjusted posterior distribution to provide Bayesian inference post selection and allows frequentist inference by a grid approximation to the conditional law of the target statistics, after eliminating nuisance parameters. Prior works in selective inference focus mainly on hypothesis testing, and capitalize on reductions achieved by conditioning out nuisance parameters. However, the techniques developed in that venue are generally less appropriate for addressing other questions, like point estimation. On the other hand, relying on an approximation to the full truncated likelihood, the tools we develop allow for more versatility including the computation of the maximum likelihood estimator (MLE). The approximation to the intractable normalizer in the conditional likelihood leads to a convex approximation to the truncated likelihood, which makes it computationally easy to numerically optimize and to analyze. The guarantee associated with the proposed approximation to the normalizer is that it captures the large deviations rate of decay of the exact selection probability. In fact, replacing the genuine truncated likelihood by its approximation, we can approximate the maximumlikelihood estimate (MLE) by solving a convex optimization problem. In fact, the MLE is globally consistent after selection, that is, it converges to the target population parameter in probability when conditioned on selection. The work in this thesis develops methodologies based on these approximation tool boxes and applies them to various real data settings to explore their use in different applications.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2018 P  Inlibrary use 
 Arthur, Joseph G., author.
 [Stanford, California] : [Stanford University], 2018.
 Description
 Book — 1 online resource.
 Summary

The comparison of individual genome sequences is a key task for modern studies of population genetics, genotypephenotype associations, and genome evolution. The problem is difficult in part because commonlyused DNA sequencing hardware produces reads that are orders of magnitude smaller than the size of a single human chromosome. The detection of large genomic mutations known as structural variants (SVs) from these short sequencing reads has emerged has a particularly challenging problem. Numerous methods targeting this problem have been proposed, but it is difficult to assess their performance on real data since the ground truth is typically unknown. Moreover, complex SVs that escape detection by conventional algorithms are known to exist. We propose here a solution to both the complex SV detection problem and the issue of evaluating accuracy on real data.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2018 A  Inlibrary use 
Online 5. Discovery and visualization of latent structure with applications to the microbiome [electronic resource] [2018]
 Sankaran, Kris.
 2018.
 Description
 Book — 1 online resource.
 Summary

Human microbiomes  the collections of bacteria living around and within the human body  are complex ecological systems, and describing their structure and function in different contexts is important from both basic scientific and medical perspectives. Viewed through a statistical lens, many microbiome analyses framed in terms of discovering and describing latent structure. For example, this structure might reflect sudden environmental shocks that affect certain subsets of species, or may illuminate gradual shifts in community composition. In this thesis, we survey and develop ideas from the data visualization and probabilistic modeling literatures that we have found useful in identifying and characterizing such structure in the microbiome. On the data visualization front, we describe the focuspluscontext and linking principles, and describe new R packages that use these ideas to facilitate visualization of hierarchical collections of time series. These tools streamline the navigation of complex data, guiding researchers towards plausible statistical models. We then turn our attention to modeling, motivated by the fact that microbiome species abundance data often have effectively lowdimensional evolutionary, temporal, and count structure. We then characterize and review methods appropriate for three classes of common microbiome data analysis problems  dimensionality reduction, multitable integration, and regime detection. For dimensionality reduction, we explore basic probabilistic latent variable models, focusing on mixedmembership and matrix factorization techniques. For multitable integration, we contrast nonparametric ordination, structured regularization, and probabilistic modeling approaches. For regime detection, we compare variants of hidden markov, dynamical systems, and changepoint models, along with baselines that don't take into account time structure. Throughout, we illustrate visualization and modeling techniques using real human gut microbiome data. Code and data for all experiments are available publicly online.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2018 S  Inlibrary use 
Online 6. Eigenvalues in multivariate random effects models [2018]
 Fan, Zhou, author.
 [Stanford, California] : [Stanford University], 2018.
 Description
 Book — 1 online resource.
 Summary

We study principal component analyses in multivariate random and mixed effects linear models. These models are commonly used in quantitative genetics to decompose the variation of phenotypic traits into consistuent variance components. Applications arising in evolutionary biology require understanding the eigenvalues and eigenvectors of these components in highdimensional multivariate settings. However, these quantities may be difficult to estimate from limited samples when the number of traits is large. We describe several phenomena concerning sample eigenvalues and eigenvectors of classical MANOVA estimators in the presence of highdimensional noise, including dispersion of the bulk eigenvalue distribution, bias and aliasing of outlier eigenvalues and eigenvectors, and TracyWidom fluctuations at the spectral edges. A common theme is that the spectral properties of the MANOVA estimate for one component may be influenced by the other components. In the setting of a simple spiked covariance model, we introduce alternative estimators for the leading eigenvalues and eigenvectors that correct for this problem in a highdimensional asymptotic regime.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2018 F  Inlibrary use 
Online 7. Generating structures by editing prototypes [2018]
 Guu, Kelvin, author.
 [Stanford, California] : [Stanford University], 2018.
 Description
 Book — 1 online resource.
 Summary

Methods for structured prediction underlie many successful applications of machine learning, including machine translation, speech synthesis, image generation, protein structure prediction and many other problems. However, producing highquality structures is challenging because the individual components of a structure (e.g., the words in a sentence or the pixels in an image) constrain and depend on each other in complex ways. A wide variety of approaches have been proposed to tackle this problem. At one extreme, retrievalbased methods sidestep the difficulty of modeling the internal consistency of structures by simply selecting from a prepopulated repository of ``good'' structures. However, this may sacrifice flexibility and coverage, as the repository could fail to contain every possible structure that might be required. At the other extreme, generationbased methods synthesize structures from scratch, often building them up one unit at a time. This offers extreme flexibility, but faces the difficult problem of modeling the complex dependencies that exist between units, as well as the challenge of searching over all possible structures for good configurations. We aim to combine the best of both worlds using a new approach called retrievethenedit: first, a structure is retrieved from a repository of high quality candidates (as in retrievalbased methods), and then it is edited into a new structure using a synthesisbased editor. The retrieval step ensures that we start in the neighborhood of an already high quality structure, while the editing step gives us the flexibility to customize the structure in a finegrained way. We demonstrate that this approach can be successfully applied across diverse structured prediction problems spanning text generation, executable code generation, and reinforcement learning to interact with a web browser (where an agent's sequence of actions is treated as a structure).
 Also online at

Online 8. Hypothesis testing using multiple data splitting [2018]
 DiCiccio, Cyrus J., author.
 [Stanford, California] : [Stanford University], 2018.
 Description
 Book — 1 online resource.
 Summary

Data splitting, a tool that is well studied and commonly used for estimation problems such as assessing prediction error, can also be useful in testing problems where a portion of the data can be allocated to make the testing problem easier in some sense, say by estimating or even eliminating nuisance parameters, dimensionreduction, etc.. In single or multiple testing problems that include a large number of parameters, there can be a dramatic increase of power by reducing the number of parameters tested, particularly when the number of nonnull parameters is relatively sparse. While there is some loss of power associated with testing on only a fraction of the available data, carefully selecting a test statistic may in turn improve power, though it remains unclear whether the reduction of the number of parameters under consideration can outweigh the loss of power from splitting the data. To combat the inherent loss of power seen with data splitting, methods of combining inference across several splits of the data are developed. The power of these methods is compared with the power of full data tests, as well as tests using only a single split of the data.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2018 D  Inlibrary use 
Online 9. Topics in statistical learning with a focus on largescale data [2018]
 Le, Ya, author.
 [Stanford, California] : [Stanford University], 2018.
 Description
 Book — 1 online resource.
 Summary

The widespread of modern information technologies to all spheres of society leads to a dramatic increase of data flow, including the formation of "big data" phenomenon. Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the ability of traditional statistical methods and standard tools. When the size of the data becomes extremely large, it may be too long to run the computing task, and even infeasible to store all of the data on a single computer. Therefore, it is necessary to turn to distributed architectures and scalable statistical methods. Big data vary in shape and call for different approaches. One type of big data is the tall data, i.e., a very large number of samples but not too many features. Chapter 1 describes a general communicationefficient algorithm for distributed statistical learning on this type of big data. Our algorithm distributes the samples uniformly to multiple machines, and uses a common reference data to improve the performance of local estimates. Our algorithm enables potentially much faster analysis, at a small cost to statistical performance. Another type of big data is the wide data, i.e., too many features but a limited number of samples. It is also called highdimensional data, to which many classical statistical methods are not applicable. Chapter 2 discusses a method of dimensionality reduction for highdimensional classification. Our method partitions features into independent communities and splits the original classification problem into separate iv smaller ones. It enables parallel computing and produces more interpretable results. For unsupervised learning methods like principle component analysis and clustering, the key challenges are choosing the optimal tuning parameter and evaluating method performance. Chapter 3 proposes a general crossvalidation approach for unsupervised learning methods. This approach randomly partitions the data matrix into K unstructured folds. For each fold, it fits a matrix completion algorithm to the rest K − 1 folds and evaluates the prediction on the holdout fold. Our approach provides a unified framework for parameter tuning in unsupervised learning, and shows strong performance in practice.
 Also online at

Online 10. Evaluating diagnostics under dependency [electronic resource] [2017]
 Michael, Haben.
 2017.
 Description
 Book — 1 online resource.
 Summary

Though estimation and inference procedures for the receiver operator characteristic ("ROC") curve are well studied in the crosssectional setting, there is less research when both biomarker measurements and disease statuses are observed longitudinally. In a motivating example, we are interested in characterizing the value of longitudinally measured CD4 counts for predicting the presence or absence of a transient spike in HIV viral load, also timedependent. The most common existing method neither appropriately characterizes the diagnostic value of observed CD4 counts nor efficiently uses status history in predicting current spike status. We propose parametric and nonparametric procedures to estimate the ROC in the longitudinal setting. Extensive simulations have been conducted to examine the small sample operational characteristics of the proposed methods.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 M  Inlibrary use 
Online 11. Financial markets and trading networks [electronic resource] [2017]
 Wang, Chaojun.
 2017.
 Description
 Book — 1 online resource.
 Summary

In this doctoral dissertation, I study the market structure and efficiency of overthecounter (OTC) financial markets. Specifically, I model how financial trading networks endogenously form, and how trading networks and trade protocols affect pricing, trading behavior, liquidity, and efficiency. This dissertation consists of two chapters. The first chapter addresses the endogenous structure of financial networks. Here, I model how coreperiphery trading networks arise endogenously in overthecounter markets as an equilibrium balance between trade competition and inventory efficiency. A small number of firms emerge as core dealers to intermediate trades among a large number of peripheral firms. The equilibrium number of dealers depends on two countervailing forces: (i) competition among dealers in their pricing of immediacy to peripheral firms, and (ii) the benefits of concentrated intermediation for lowering dealer inventory risk through dealers' ability to quickly net purchases against sales. For an asset with a lower frequency of trade demand, intermediation is concentrated among fewer dealers, and interdealer trades account for a greater fraction of total trade volume. I show, party in separate work, that these two predictions are strongly supported by evidence from the markets for German sovereign bonds and U.S. corporate bonds. From a welfare viewpoint, I show that there are too few dealers for assets with frequent trade demands, and too many dealers for assets with infrequent trade demands. The second chapter models bargaining in overthecounter network markets over the terms and prices of contracts. Of concern is whether bilateral noncooperative bargaining is sufficient to achieve efficiency in this multilateral setting. For example, will market participants assign insolvencybased seniority in a socially efficient manner, or should bankruptcy laws override contractual terms with an automatic stay? The model provides conditions under which bilateral bargaining over contingent contracts is efficient for a network of market participants. Examples include seniority assignment, closeout netting and collateral rights, secured debt liens, and leveragebased covenants. Given the ability to use covenants and other contingent contract terms, central market participants efficiently internalize the costs and benefits of their counterparties through the pricing of contracts. We provide counterexamples to efficiency for less contingent forms of bargaining coordination.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 W  Inlibrary use 
Online 12. Largescale inference with block structure [electronic resource] [2017]
 Kou, Jiyao.
 2017.
 Description
 Book — 1 online resource.
 Summary

The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the largescale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, e.g. if the signal is clustered in many small blocks, as is the case in relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. This result recovers as special cases the heterogeneous mixture detection problem (1) where there is no structure in the signal, as well as scan problem (2) where the signal comprises a single interval. We develop methodology that allows optimal adaptive detection in the general setting, thus exploiting the structure if it is present without incurring a penalty in the case where there is no structure. The advantage of this methodology can be considerable, as in the latter case the means need to increase at the rate of square root of the log of the problem size to ensure detection, while in the former case the means may decrease at a polynomial rate. The identification version of this problem is also considered in this thesis, where the length of the block is allowed to grow polynomially with the sample size while the number of blocks is assumed to grow at most logarithmically with the sample size. This setting greatly generalizes previous results. The multivariate version of this problem is also considered, in which we try to identify the support of the rectangular signal(s) in the hyperrectangle. A lower bound below which the identification is impossible is presented and an asymptotically optimal and computationally efficient procedure is proposed under Gaussian white noise. This signal identification problem is shown to have the same statistical difficulty as the corresponding detection problem, in the sense that whenever we can detect the signal, we can identify the support of the signal. We also discuss about the signal identification problem under the exponential family setting and the robust identification problem where the noise distribution is unspecified.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 K  Inlibrary use 
Online 13. Leveraging similarity in statistical learning [electronic resource] [2017]
 Powers, Scott Stephen.
 2017.
 Description
 Book — 1 online resource.
 Summary

Machine learning literature has provided a toolbox of techniques for general use by the modern applied statistician. For some data analysis, these methods can be improved to yield better results based on the properties of the data in question. In this manuscript, we develop methodology for three different problems in which, in some sense, similarity can be leveraged to make better predictions, as demonstrated in baseball and epidemiology applications. First, we propose adding a penalty on the nuclear norm of the regression coefficient matrix in multinomial regression to learn which outcomes are similar. Second, we propose adding a clustering step to l1 penalized regression, to build customized training sets of observations that are similar to the test set. Third, we propose and evaluate several approaches to mining electronic medical records to support physician decisionmaking with data on patients similar to a patient in question. This last problem is especially challenging because of the observational and highdimensional nature of the data.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 P  Inlibrary use 
Online 14. Measuring sample quality with Stein's method [electronic resource] [2017]
 Gorham, Jackson.
 2017.
 Description
 Book — 1 online resource.
 Summary

As the size of datasets has grown, classical methods like Markov chain Monte Carlo have become increasingly burdensome from a computational perspective. Practitioners have been turning to biased Markov chain Monte Carlo procedures that are able to trade off asymptotic exactness for computational speed. Unfortunately, previously used diagnostics to aid with these methods are insufficient for assessing the asymptotic bias incurred. We will first introduce a new computable quality measure based on Stein's method that quantifies the maximum discrepancy between sample and target expectations over a large class of test functions. Our first main theoretical contribution will be showing that our measure converges to zero only if a sample converges to its target distribution. Empirically we will show this discrepancy avoids the problems faced by previous diagnostics, e.g., effective sample size. Our next step will be to generalize these ideas to cover a larger class of target distributions. By studying Ito diffusions with fastmixing rates, we will be able to extend the purview of acceptable target distributions from distributions with strongly logconcave densities with bounded third and fourth derivatives to a much larger class that includes multimodal and heavytailed distributions. Finally, in the last chapter, we will study a variation of our previous methods that is computationally feasible to obtain for much larger samples. This variation will combine our ideas with those from reproducing kernel Hilbert spaces to define a closedform expression for our measure. While other authors have also recommended this faster variation, we show that most common choices for the kernel function are insufficient for controlling convergence in the multivariate setting. A special class of kernel functions that do indeed control convergence will be proposed and shown to dominate the other traditional kernel functions empirically.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 G  Inlibrary use 
 Janson, Lucas Beck.
 2017.
 Description
 Book — 1 online resource.
 Summary

Consider trying to understand the genetic basis of a disease. A natural first step would be to sequence the genomes of a large group of people and note whether each person has the disease or not. With such a data set one might hope to answer questions such as which mutations make the disease more likely, or how much all the mutations together explain disease contraction as opposed to environmental factors. Unfortunately, due to the large number (hundreds of thousands or more) of locations on the genome with potential mutations, classical statistical techniques cannot be used to answer such questions. Similar problems with a response variable of interest and many potential explanatory variables (known as 'highdimensional' problems) abound in modern statistical applications, including in medicine, political science, advertising, and many more. The driving force behind the recent surge in such highdimensional problems is that it has become easier and less expensive to collect, store, and process increasing amounts of information about individuals such as entire genomes, medical records, or online behavior. The limitations of classical methods in highdimensional settings demand innovation in the statistical field of highdimensional inference. In addition to requiring creative mathematical insight, most highdimensional inference problems are nonstarters without some further assumptions about the underlying process generating the data. As such, a constant challenge and source of debate regards the best way to make assumptions that are realistic, verifiable, and allow for fast and powerful methods. This thesis contributes to the discussion by surveying existing methods along with their assumptions, proposing a different perspective on how assumptions are made, and highlighting the benefits of that perspective by detailing two novel methods (developed jointly by the author and his collaborators) for highdimensional inference that embody it.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 J  Inlibrary use 
Online 16. Monotone interactions of random walks and graphs [electronic resource] [2017]
 Huang, Ruojun.
 2017.
 Description
 Book — 1 online resource.
 Summary

This thesis deals with the behavior of random walks on monotonically timevarying graphs, where due to this monotonicity we can establish certain universality properties. In Chapter 1, we consider normally reflected Brownian motion (RBM) and simple random walk (SRW) on independently growingintime ddimensional domains, d> =3. We establish a sharp criterion for recurrence versus transience in terms of the growth rate. For more general growing subgraphs of an infinite graph, we use evolving sets to establish heat kernel bounds that yield sufficient transience/recurrence criteria. In contrast, we demonstrate rich and nonuniversal behavior of certain nonMarkovian models of random walks, in which monotone interaction enforces domain growth as a result of visits by the walk (or probes it sent), to the neighborhood of domain boundary. This is complemented by Chapter 2, where we address stability issues for random walks among timedependent conductances. For both the discretetime uniformlylazy and continuoustime constantspeed random walks (DTRW/CSRW), we show that Gaussian heat kernel estimates are not stable under perturbations. As a byproduct, we refute an open question about inhomogeneous merging of finite Markov chains. We establish matching Gaussian upper and lower transition density bounds for the CSRW among timeincreasing conductances on any graph satisfying a uniformintime Poincare inequality and volume growth regularity. In contrast, stability is known for the variablespeed random walk (VSRW), for which the counting measure is the reversing measure.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 H  Inlibrary use 
 Fukuyama, Julia.
 2017.
 Description
 Book — 1 online resource.
 Summary

We describe methods for incorporating information about variable structure in multivariate analysis. We are motivated by the example of microbiome data, where we have species abundances in addition to information about the evolutionary history of those species. However, the methods we develop are applicable to more general types of structure on the variables, and we expect to find many other applications both in biology and in other fields. We develop a method for structured dimensionality reduction, adaptive gPCA. This method allows for tunable incorporation of the variable structure, so that the finescale structure of the variables, the global structure, or anything in between can be brought out. We then move to the problem of incorporating sparsity in addition to the structure to obtain a sparse and structured PCA as well as sparse and structured discriminant analysis for classification problems. The primary motivation behind incorporating sparsity and structure together is to obtain more interpretable results, but we also show that incorporating these two elements can substantially improve classification accuracy in the supervised setting.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 F  Inlibrary use 
 Sepehri, Amir.
 2017.
 Description
 Book — 1 online resource.
 Summary

The current state of statistical practice is dominated by the use of modelbased methods. A natural basic question is whether the model fits the data. This is a central question and has been studied for over a century. There is a vast literature available; however, a majority of the literature is focused on the univariate case. Far less work has been done on the general case and the theory is very sparse. This thesis introduces a general methodology, based on (noncommutative) Fourier analysis, to construct tests of goodness of fit for distributions on relatively general spaces. The procedure is developed for a simple null hypothesis and has been carried out for several examples, including the normal distribution, the uniform distribution, and the uniform distribution on highdimensional spheres. The method is used to construct two families of tests of uniformity on the compact classical groups. Carrying out the program for the compact groups involves a substantial use of the representation theory of Lie groups, including derivation of new group theoretic formulas. These tests are used to numerically study the mixingtime of a recently introduced Markov chain Monte Carlo sampler on the orthogonal group. The method is extended to the composite null hypothesis of parametric families of distributions. The asymptotic properties are studied and computational aspects are discussed. This has been carried out for several examples: testing for multivariate normality, testing for the beta family, testing for the gamma family, testing for the chisquare family, and testing for the exponential family. Motivated by an application in material sciences, the method has been successfully carried out for several parametric families of distributions on the group of three dimensional rotations. The power function against local alternatives is studied in details and various properties are established. In particular, these tests are asymptotically admissible. The tests have been tried out on several examples and show favorable performance compared to existing methods.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 S  Inlibrary use 
Online 19. Optimization, random graphs, and spin glasses [electronic resource] [2017]
 Sen, Subhabrata.
 2017.
 Description
 Book — 1 online resource.
 Summary

This thesis studies a class of optimization problems on sparse random (hyper)graphs. Of central focus is the optimum value of these problems on graphs with a large number of vertices. The first part introduces a general framework to study the typical value of some combinatorial optimization problems on sparse random graphs, via a connection with meanfield spin glasses. An an application, we derive novel bounds on various combinatorial quantities such as MaxCut, minbisection, max XORSAT etc. The second part of this thesis studies a semidefinite relaxation of the minbisection problem on sparse random graphs. As a consequence, we provide near optimal recovery guarantees for a natural semidefinite programming based algorithm for the community detection problem.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 S  Inlibrary use 
Online 20. Prediction and dimension reduction methods in computer experiments [electronic resource] [2017]
 Lee, Minyong R.
 2017.
 Description
 Book — 1 online resource.
 Summary

In many fields of engineering and science, computer experiments have become essential tools in studying physical processes. This dissertation reviews standard prediction methods and dimension reduction methods in the analysis of computer experiments and proposes new approaches. Response surface modeling is the starting point of the analysis of computer experiments. Kriging or Gaussian process regression is widely used in constructing response surfaces. We propose Single Nugget Kriging, which is a method with better predictions at extreme values than the standard method of Kriging. Our prediction exhibits robustness to the model mismatch in the covariance parameters, a desirable feature for computer simulations with a restricted number of data points. For high dimensional computer experiments, dimension reduction methods in regression are essential for solving optimization problems and inverse problems. We compare modelfree sufficient dimension reduction methods and the active subspace for computer experiments. We propose a modification of the active subspace. We further discuss the analysis of dimension reduction methods in computer experiments, using projected Gaussian processes.
 Also online at

Special Collections
Special Collections  Status 

University Archives  Request onsite access 
3781 2017 L  Inlibrary use 