articles+ search results
1,507 articles+ results
1  20
Next
2. MMIL: A novel algorithm for disease associated cell type discovery

Craig, Erin, Keyes, Timothy, Sarno, Jolanda, Zaslavsky, Maxim, Nolan, Garry, Davis, Kara, Hastie, Trevor, and Tibshirani, Robert
 Subjects

Quantitative Biology  Quantitative Methods, Computer Science  Machine Learning, and Statistics  Methodology
 Abstract

Singlecell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of celllevel classifiers using patientlevel labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinicallyannotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown goldstandard labels and high dimensionality.
Comment: Erin Craig and Timothy Keyes contributed equally to this work
 Full text View this record from Arxiv
3. A Fast and Scalable PathwiseSolver for Group Lasso and Elastic Net Penalized Regression via BlockCoordinate Descent

Yang, James and Hastie, Trevor
 Subjects

Statistics  Computation, Computer Science  Machine Learning, Computer Science  Mathematical Software, and Computer Science  Software Engineering
 Abstract

We develop fast and scalable algorithms based on blockcoordinate descent to solve the group lasso and the group elastic net for generalized linear models along a regularization path. Special attention is given when the loss is the usual least squares loss (Gaussian loss). We show that each blockcoordinate update can be solved efficiently using Newton's method and further improved using an adaptive bisection method, solving these updates with a quadratic convergence rate. Our benchmarks show that our package adelie performs 3 to 10 times faster than the next fastest package on a wide array of both simulated and real datasets. Moreover, we demonstrate that our package is a competitive lasso solver as well, matching the performance of the popular lasso package glmnet.
 Full text View this record from Arxiv
4. Using Pretraining and Interaction Modeling for ancestryspecific disease prediction in UK Biobank

Menestrel, Thomas Le, Craig, Erin, Tibshirani, Robert, Hastie, Trevor, and Rivas, Manuel
 Subjects

Computer Science  Machine Learning, Quantitative Biology  Quantitative Methods, Statistics  Applications, and Statistics  Computation
 Abstract

Recent genomewide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an underrepresentation of nonEuropean descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of GroupLASSO INTERactionNET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROCAUC scores (pvalue < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pretraining can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy
 Full text View this record from Arxiv
5. The mosaic permutation test: an exact and nonparametric goodnessoffit test for factor models

Spector, Asher, Barber, Rina Foygel, Hastie, Trevor, Kahn, Ronald N., and Candès, Emmanuel
 Subjects

Statistics  Methodology and 62H25 (Primary) 62G10, 62G09 (Secondary)
 Abstract

Financial firms often rely on factor models to explain correlations among asset returns. These models are important for managing risk, for example by modeling the probability that many assets will simultaneously lose value. Yet after major events, e.g., COVID19, analysts may reassess whether existing models continue to fit well: specifically, after accounting for the factor exposures, are the residuals of the asset returns independent? With this motivation, we introduce the mosaic permutation test, a nonparametric goodnessoffit test for preexisting factor models. Our method allows analysts to use nearly any machine learning technique to detect model violations while provably controlling the false positive rate, i.e., the probability of rejecting a wellfitting model. Notably, this result does not rely on asymptotic approximations and makes no parametric assumptions. This property helps prevent analysts from unnecessarily rebuilding accurate models, which can waste resources and increase risk. We illustrate our methodology by applying it to the Blackrock Fundamental Equity Risk (BFRE) model. Using the mosaic permutation test, we find that the BFRE model generally explains the most significant correlations among assets. However, we find evidence of unexplained correlations among certain real estate stocks, and we show that adding new factors improves model fit. We implement our methods in the python package mosaicperm.
Comment: 38 pages, 13 figures
 Full text View this record from Arxiv
6. Temporal dynamics of the multiomic response to endurance exercise training

Bae, Dam, Dasari, Surendra, Dennis, Courtney, Evans, Charles R, Gaul, David A, Ilkayeva, Olga, Ivanova, Anna A, Kachman, Maureen T, Keshishian, Hasmik, Lanza, Ian R, Lira, Ana C, Muehlbauer, Michael J, Nair, Venugopalan D, Piehowski, Paul D, Rooney, Jessica L, Smith, Kevin S, Stowe, Cynthia L, Zhao, Bingqing, Clark, Natalie M, JimenezMorales, David, Lindholm, Malene E, Many, Gina M, Sanford, James A, Smith, Gregory R, Vetr, Nikolai G, Zhang, Tiantian, Almagro Armenteros, Jose J, AvilaPacheco, Julian, Bararpour, Nasim, Ge, Yongchao, Hou, Zhenxin, Marwaha, Shruti, Presby, David M, Natarajan Raja, Archana, Savage, Evan M, Steep, Alec, Sun, Yifei, Wu, Si, Zhen, Jimmy, Bodine, Sue C, Esser, Karyn A, Goodyear, Laurie J, Schenk, Simon, Montgomery, Stephen B, Fernández, Facundo M, Sealfon, Stuart C, Snyder, Michael P, Adkins, Joshua N, Ashley, Euan, Burant, Charles F, Carr, Steven A, Clish, Clary B, Cutter, Gary, Gerszten, Robert E, Kraus, William E, Li, Jun Z, Miller, Michael E, Nair, K Sreekumaran, Newgard, Christopher, Ortlund, Eric A, Qian, WeiJun, Tracy, Russell, Walsh, Martin J, Wheeler, Matthew T, Dalton, Karen P, Hastie, Trevor, Hershman, Steven G, Samdarshi, Mihir, Teng, Christopher, Tibshirani, Rob, Cornell, Elaine, Gagne, Nicole, May, Sandy, Bouverat, Brian, Leeuwenburgh, Christiaan, Lu, Chingju, Pahor, Marco, Hsu, FangChi, Rushing, Scott, Walkup, Michael P, Nicklas, Barbara, Rejeski, W Jack, Williams, John P, Xia, Ashley, Albertson, Brent G, Barton, Elisabeth R, Booth, Frank W, Caputo, Tiziana, Cicha, Michael, De Sousa, Luis Gustavo Oliveira, Farrar, Roger, Hevener, Andrea L, Hirshman, Michael F, Jackson, Bailey E, Ke, Benjamin G, Kramer, Kyle S, Lessard, Sarah J, Makarewicz, Nathan S, Marshall, Andrea G, and Nigro, Pasquale
 Nature. 629(8010)
 Subjects

Health Sciences, Sports Science and Exercise, Prevention, Genetics, Human Genome, Physical Activity, Behavioral and Social Science, Cardiovascular, Aetiology, 1.1 Normal biological development and functioning, Underpinning research, 2.1 Biological and endogenous factors, Inflammatory and immune system, Generic health relevance, Good Health and Well Being, Animals, Female, Humans, Male, Rats, Acetylation, Blood, Cardiovascular Diseases, Databases, Factual, Endurance Training, Epigenome, Inflammatory Bowel Diseases, Internet, Lipidomics, Metabolome, Mitochondria, Multiomics, Nonalcoholic Fatty Liver Disease, Organ Specificity, Phosphorylation, Physical Conditioning, Animal, Physical Endurance, Proteome, Proteomics, Time Factors, Transcriptome, Ubiquitination, Wounds and Injuries, MoTrPACStudy Group, Lead Analysts, MoTrPAC Study Group, and General Science & Technology
 Abstract

Regular exercise promotes wholebody health and prevents disease, but the underlying molecular mechanisms are incompletely understood13. Here, the Molecular Transducers of Physical Activity Consortium4 profiled the temporal transcriptome, proteome, metabolome, lipidome, phosphoproteome, acetylproteome, ubiquitylproteome, epigenome and immunome in whole blood, plasma and 18 solid tissues in male and female Rattus norvegicus over eight weeks of endurance exercise training. The resulting data compendium encompasses 9,466 assays across 19 tissues, 25 molecular platforms and 4 training time points. Thousands of shared and tissuespecific molecular alterations were identified, with sex differences found in multiple tissues. Temporal multiomic and multitissue analyses revealed expansive biological insights into the adaptive responses to endurance training, including widespread regulation of immune, metabolic, stress response and mitochondrial pathways. Many changes were relevant to human health, including nonalcoholic fatty liver disease, inflammatory bowel disease, cardiovascular health and tissue injury and recovery. The data and analyses presented in this study will serve as valuable resources for understanding and exploring the multitissue molecular effects of endurance training and are provided in a public repository ( https://motrpacdata.org/ ).
 Full text View record at eScholarship
7. Factor Fitting, Rank Allocation, and Partitioning in Multilevel Low Rank Matrices

Parshakova, Tetiana, Hastie, Trevor, Darve, Eric, and Boyd, Stephen
 Subjects

Statistics  Machine Learning, Computer Science  Machine Learning, Computer Science  Mathematical Software, and Mathematics  Optimization and Control
 Abstract

We consider multilevel low rank (MLR) matrices, defined as a row and column permutation of a sum of matrices, each one a block diagonal refinement of the previous one, with all blocks low rank given in factored form. MLR matrices extend low rank matrices but share many of their properties, such as the total storage required and complexity of matrixvector multiplication. We address three problems that arise in fitting a given matrix by an MLR matrix in the Frobenius norm. The first problem is factor fitting, where we adjust the factors of the MLR matrix. The second is rank allocation, where we choose the ranks of the blocks in each level, subject to the total rank having a given value, which preserves the total storage needed for the MLR matrix. The final problem is to choose the hierarchical partition of rows and columns, along with the ranks and factors. This paper is accompanied by an open source package that implements the proposed methods.
 Full text View this record from Arxiv
8. A Statistical View of Column Subset Selection

Sood, Anav and Hastie, Trevor
 Subjects

Statistics  Methodology, Computer Science  Data Structures and Algorithms, and Computer Science  Machine Learning
 Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an informationmaximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semiparametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
 Full text View this record from Arxiv
9. Scalable solution to crossed random effects model with random slopes

Ghandwani, Disha, Ghosh, Swarnadip, Hastie, Trevor, and Owen, Art B.
 Subjects

Statistics  Methodology
 Abstract

The crossed random effects model is widely used, finding applications in various fields such as longitudinal studies, ecommerce, and recommender systems, among others. However, these models encounter scalability challenges, as the computational time for standard algorithms grows superlinearly with the number N of observations in the data set, commonly $\Omega(N^{3/2})$ or worse. Recent work has developed scalable methods for crossed random effects in linear models and some generalized linear models, but those works only allow for random intercepts. In this paper we devise scalable algorithms for models that include random slopes. This problem brings a substantial difficulty in estimating the random effect covariance matrices in a scalable way. We address that issue by using a variational EM algorithm. In simulations, we see that the proposed method is faster than standard methods. It is also more efficient than ordinary least squares which also has a problem of greatly underestimating the sampling uncertainty in parameter estimates. We illustrate the new method on a large dataset (five million observations) from the online retailer Stitch Fix.
 Full text View this record from Arxiv
10. RbX: Regionbased explanations of prediction models

Lemhadri, Ismael, Li, Harrison H., and Hastie, Trevor
 Subjects

Statistics  Machine Learning and Computer Science  Machine Learning
 Abstract

We introduce regionbased explanations (RbX), a novel, modelagnostic method to generate local explanations of scalar outputs from a blackbox prediction model using only query access. RbX is based on a greedy algorithm for building a convex polytope that approximates a region of feature space where model predictions are close to the prediction at some target point. This region is fully specified by the user on the scale of the predictions, rather than on the scale of the features. The geometry of this polytope  specifically the change in each coordinate necessary to escape the polytope  quantifies the local sensitivity of the predictions to each of the features. These "escape distances" can then be standardized to rank the features by local importance. RbX is guaranteed to satisfy a "sparsity axiom," which requires that features which do not enter into the prediction model are assigned zero importance. At the same time, real data examples and synthetic experiments show how RbX can more readily detect all locally relevant features than existing methods.
Comment: 13 pages, 4 figures
 Full text View this record from Arxiv
11. Smooth multiperiod forecasting with application to prediction of COVID19 cases

Tuzhilina, Elena, Hastie, Trevor J., McDonald, Daniel J., Tay, J. Kenneth, and Tibshirani, Robert
 Subjects

Statistics  Methodology
 Abstract

Forecasting methodologies have always attracted a lot of attention and have become an especially hot topic since the beginning of the COVID19 pandemic. In this paper we consider the problem of multiperiod forecasting that aims to predict several horizons at once. We propose a novel approach that forces the prediction to be "smooth" across horizons and apply it to two tasks: point estimation via regression and interval prediction via quantile regression. This methodology was developed for realtime distributed COVID19 forecasting. We illustrate the proposed technique with the CovidCast dataset as well as a small simulation example.
 Full text View this record from Arxiv
12. Confidence Intervals for the Generalisation Error of Random Forests

Rajanala, Samyak, Bates, Stephen, Hastie, Trevor, and Tibshirani, Robert
 Subjects

Statistics  Methodology
 Abstract

Outofbag error is commonly used as an estimate of generalisation error in ensemblebased learning models such as random forests. We present confidence intervals for this quantity using the deltamethodafterbootstrap and the jackknifeafterbootstrap techniques. These methods do not require growing any additional trees. We show that these new confidence intervals have improved coverage properties over the naive confidence interval, in real and simulated examples.
Comment: 25 pages, 8 tables, 8 figures
 Full text View this record from Arxiv
13. Publisher Correction: Principal component analysis

Greenacre, Michael, Groenen, Patrick J. F., Hastie, Trevor, D’Enza, Alfonso Iodice, Markos, Angelos, and Tuzhilina, Elena
 Nature Reviews Methods Primers. 3(1)
 Full text View on content provider's site
14. A modified MichaelisMenten equation estimates growth from birth to 3 years in healthy babies in the USA

Walters, William A., Ley, Catherine, Hastie, Trevor, Ley, Ruth E., and Parsonnet, Julie
 BMC Medical Research Methodology. February 1, 2024, Vol. 24 Issue 1
 Full text View/download PDF
16. Weighted Low Rank Matrix Approximation and Acceleration

Tuzhilina, Elena and Hastie, Trevor
 Subjects

Statistics  Machine Learning, Computer Science  Machine Learning, and Statistics  Methodology
 Abstract

Lowrank matrix approximation is one of the central concepts in machine learning, with applications in dimension reduction, denoising, multivariate statistical methodology, and many more. A recent extension to LRMA is called lowrank matrix completion (LRMC). It solves the LRMA problem when some observations are missing and is especially useful for recommender systems. In this paper, we consider an elementwise weighted generalization of LRMA. The resulting weighted lowrank matrix approximation technique therefore covers LRMC as a special case with binary weights. WLRMA has many applications. For example, it is an essential component of GLM optimization algorithms, where an exponential family is used to model the entries of a matrix, and the matrix of natural parameters admits a lowrank structure. We propose an algorithm for solving the weighted problem, as well as two acceleration techniques. Further, we develop a nonSVD modification of the proposed algorithm that is able to handle extremely highdimensional data. We compare the performance of all the methods on a small simulation example as well as a realdata application.
 Full text View this record from Arxiv
17. LinCDE: Conditional Density Estimation via Lindsey's Method

Gao, Zijun and Hastie, Trevor
 Subjects

Statistics  Methodology
 Abstract

Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and nonnegative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE's efficacy through extensive simulations and three real data examples.
Comment: 50 pages, 20 figures
 Full text View this record from Arxiv
18. Featureweighted elastic net: Using 'features of features' for 'better prediction'

Tay, J. Kenneth, Aghaeepour, Nima, Hastie, Trevor, and Tibshirani, Robert
 Statistica Sinica. 33(1):259280
 Full text View on content provider's site
19. Scalable logistic regression with crossed random effects

Ghosh, Swarnadip, Hastie, Trevor, and Owen, Art B.
 Subjects

Statistics  Methodology, Mathematics  Statistics Theory, and Statistics  Computation
 Abstract

The cost of both generalized least squares (GLS) and Gibbs sampling in a crossed random effects model can easily grow faster than $N^{3/2}$ for $N$ observations. Ghosh et al. (2020) develop a backfitting algorithm that reduces the cost to $O(N)$. Here we extend that method to a generalized linear mixed model for logistic regression. We use backfitting within an iteratively reweighted penalized least square algorithm. The specific approach is a version of penalized quasilikelihood due to Schall (1991). A straightforward version of Schall's algorithm would also cost more than $N^{3/2}$ because it requires the trace of the inverse of a large matrix. We approximate that quantity at cost $O(N)$ and prove that this substitution makes an asymptotically negligible difference. Our backfitting algorithm also collapses the fixed effect with one random effect at a time in a way that is analogous to the collapsed Gibbs sampler of Papaspiliopoulos et al. (2020). We use a symmetric operator that facilitates efficient covariance computation. We illustrate our method on a real dataset from Stitch Fix. By properly accounting for crossed random effects we show that a naive logistic regression could underestimate sampling variances by several hundred fold.
Comment: 32 pages, 5 figures
 Full text View this record from Arxiv
20. Crossvalidation: what does it estimate and how well does it do it?

Bates, Stephen, Hastie, Trevor, and Tibshirani, Robert
 Subjects

Statistics  Methodology, Mathematics  Statistics Theory, Statistics  Computation, and Statistics  Machine Learning
 Abstract

Crossvalidation is a widelyused technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that crossvalidation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from crossvalidation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested crossvalidation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional crossvalidation intervals fail.
 Full text View this record from Arxiv
Catalog
Books, media, physical & digital resources
Guides
Course and topicbased guides to collections, tools, and services.
1  20
Next