1 - 20
Next
Online 1. Raman Spectroscopy Deconvolution Using Machine Learning [2019]
- Wang, Winston (Author)
- August 15, 2022; May 2019
- Description
- Book
- Summary
-
Proper characterization of a tumor is essential for informing treatment and assessing prognosis for the patient. The concentration of certain biomarkers, presence or absence of specific mutations, and even the pattern of distribution of biomarkers throughout the tumor can be extremely important in determining the aggressiveness of the tumor. For example, tumors that are more homogenous tend to be less aggressive and have a better prognosis than those that are more heterogenous. Therefore, we seek to image multiple biomarkers in vivo using targeted dyes. However, imaging the tumor multiple times in series without an invasive biopsy is much too time consuming. Usually, for these targeted dyes, imaging occurs 5-7 days after the imaging agent is administered, meaning that imaging any more than one or two biomarkers would be prohibitively lengthy, possibly affecting timely treatment of the tumor. Therefore, we can utilize unique surface-enhanced resonance Ra- man scattering (SERRS) particles to target distinct biomarkers. In order to determine the concentration of each particle at each point, the individual spectra need to be deconvolved from the multiplexed spectra. Conventional methods for separating the spectra, nonnegative least squares (NNLS), have been successful for low numbers of spectra. However, NNLS must calculate the pseudoinverse, and as the number of spectra increases, the condition number for that matrix increases quickly. Thus, past five spectra or so, using NNLS to deconvolute the spectra becomes untenable. Thus, we aim to use machine learning as an alternative to NNLS, with the potential to expand spectral deconvolution to more spectra accurately and quickly during run time.
- Digital collection
- Stanford Theses and Dissertations
Online 2. A complexity-theoretic perspective on fairness [2020]
- Kim, Michael Pum-Shin, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
Algorithms make predictions about people constantly. The spread of such prediction systems---from precision medicine to targeted advertising to predictive policing---has raised concerns that algorithms may perpetrate unfair discrimination, especially against individuals from minority groups. While it's easy to speculate on the risks of unfair prediction, devising an effective definition of algorithmic fairness is challenging. Most existing definitions tend toward one of two extremes---individual fairness notions provide theoretically-appealing protections but present practical challenges at scale, whereas group fairness notions are tractable but offer marginal protections. In this thesis, we propose and study a new notion---multi-calibration---that strengthens the guarantees of group fairness while avoiding the obstacles associated with individual fairness. Multi-calibration requires that predictions be well-calibrated, not simply on the population as a whole but simultaneously over a rich collection of subpopulations C. We specify this collection---which parameterizes the strength of the multi-calibration guarantee---in terms of a class of computationally-bounded functions. Multi-calibration protects every subpopulation that can be identified within the chosen computational bound. Despite such a demanding requirement, we show a generic reduction from learning a multi-calibrated predictor to (agnostic) learning over the chosen class C. This reduction establishes the feasibility of multi-calibration: taking C to be a learnable class, we can achieve multi-calibration efficiently (both statistically and computationally). To better understand the requirement of multi-calibration, we turn our attention from fair prediction to fair ranking. We establish an equivalence between a semantic notion of domination-compatibility in rankings and the technical notion of multi-calibration in predictors---while conceived from different vantage points, these concepts encode the same notion of evidence-based fairness. This alternative characterization illustrates how multi-calibration affords qualitatively different protections than standard group notions
- Also online at
-
Online 3. Data science for social equality [2020]
- Pierson, Emma, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
Recent work in algorithmic fairness has highlighted the ways in which machine learning and data science can exacerbate already profound social inequalities. While invaluable, this work should not cause us to lose sight of the more optimistic counterpoint: that machine learning and data science have the potential to also reduce social inequality if properly applied. This dissertation explores this potential. In the first half of the dissertation, we provide two examples illustrating how data science and machine learning can improve healthcare for underserved populations. We first develop a deep learning algorithm which identifies pain-relevant features in knee osteoarthritis x-rays which conventional severity measures overlook, but which help explain higher pain levels in black, lower-income, and lower-education patients. Secondly, we use data from a women's health app to decompose women's mood, behavior, and vital signs into four simultaneous cycles --- daily, weekly, seasonal, and menstrual --- and reveal that the menstrual cycle, though often invisible in past analyses, is the largest of the four cycles. In the second half of the dissertation, we provide two examples illustrating how data science and machine learning can detect bias in human decision-making, focusing on policing as an application domain. We first describe a new family of probability distributions and use them to accelerate a Bayesian test for discrimination by two orders of magnitude, allowing it to scale to much larger datasets. We then apply this test to a national dataset of traffic stops which we collect via public records requests and publicly release. The methods we develop are more broadly applicable to assessing bias in many other human decisions
- Also online at
-
Online 4. Deep learning in computational biology : from predictive modeling to knowledge extraction [2022]
- Wu, Zhenqin, author.
- [Stanford, California] : [Stanford University], 2022
- Description
- Book — 1 online resource
- Summary
-
The rapid development of deep learning methods has transformed concepts and pipelines in the analysis of large-scale data cohorts. In parallel, datasets of unprecedented size and diversity stemming from novel biological experimental techniques have largely exceeded the capacity of conventional human-engineered tools. Driven by the versatility and expressive power of deep neural networks, the past few years have witnessed a burst in efforts to incorporate deep learning-based techniques to model the rich information from experimental data. In addition to the need for accurate predictive modeling, biological research problems place great emphasis on model interpretability, aiming to unravel the underlying mechanism by extracting model-learned knowledge. With these challenges posed by the new techniques and datasets in mind, I present three works in this thesis that developed deep learning-based tools to model, analyze and understand various types of molecular and cellular data. In the first project, we summarized methods and datasets for molecular machine learning and proposed a large-scale benchmark MoleculeNet to facilitate the comparison of model efficacy. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high-quality open-source implementations of multiple molecular featurization and learning algorithms. MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance, though learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. We further recognized that for quantum mechanical and biophysical datasets, the use of physics-aware featurization can be more important than the choice of modeling algorithm. In the second project, we proposed an automated analysis tool: DynaMorph for quantitative live-cell imaging. DynaMorph is composed of multiple modules sequentially applied to perform cell segmentation, tracking, and self-supervised morphology encoding. We employed DynaMorph to learn the cellular morphodynamics of live microglia through label-free measurements of optical density and anisotropy. These cells show complex behavior and have varied responses to disease-relevant perturbations. DynaMorph generates quantitative morphodynamic representations that can be used to compare the effects of the perturbations. Furthermore, by analyzing DynaMorph representations we identify distinct morphodynamic states of microglia polarization and detect rare transition events between states. In the third project, I studied spatial cellular community structures based on multiplex immunofluorescence imaging. By parsing high-resolution immunofluorescence images as graphical representations of cellular communities, we developed SPAtial CEllular Graphical Modeling (SPACE-GM), a geometric deep learning framework that models tumor microenvironments as cellular graphs. We applied SPACE-GM to human head-and-neck and colorectal cancer samples assayed with 40-plex immunofluorescence imaging to identify spatial motifs associated with patient survival and recurrence outcomes after immunotherapy. SPACE-GM achieves substantially higher accuracy in predicting patient outcomes than previous approaches based on neighborhood cell-type compositions. Computational interpretation of the disease-relevant microenvironments identified by SPACE-GM generates insights into the effect of spatial dispersion of tumor cells and granulocytes on patient prognosis
- Also online at
-
- Ginart, Antonio Alejandro, author.
- [Stanford, California] : [Stanford University], 2022
- Description
- Book — 1 online resource
- Summary
-
In this work, we explore theory and algorithms that improve the efficiency of various aspects of machine learning systems. First, we investigate algorithmic principles that enable efficient machine unlearning in machine learning. We propose two unsupervised learning algorithms which achieve over 100 times improvement in online data deletion, while producing clusters of comparable statistical quality to a canonical k-means++ baseline. Second, we explore mixed dimension embeddings, an embedding layer architecture in which a particular embedding vector's dimension scales with its query frequency. Through theoretical analysis and systematic experiments, we demonstrate that using mixed dimensions can drastically reduce the memory usage, while maintaining and even improving predictive performance. Mixed dimension layers improve accuracy by 0.1% using half as many parameters or maintain it using 16 times fewer parameters for click-through rate prediction on the Criteo Kaggle dataset. They also train over 2 times faster on a GPU. Finally, we propose a novel approach, MLDemon, for ML Deployment monitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of a deployed model's current accuracy on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. MLDemon compares favorably to prior methods on benchmarks. We also provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts
- Also online at
-
Online 6. Sleep and death : the relationship between REM sleep and mortality [2020]
- Leary, Eileen B., author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
Sleep is a non-negotiable requirement for a happy, healthy life. In the last 70 years, our understanding of sleep has grown exponentially. However, in our busy society, sleep is often overlooked and undervalued. This is surprising given that sleep disorders and sleep dysregulation have been linked to multiple systemic and brain-based diseases, including cardiovascular disease, type 2 diabetes, dementia, and major depressive disorder. Additionally, sleep disorders and sleep characteristics (e.g. sleep duration) have been linked to higher rates of mortality. Despite the emerging evidence of a sleep-mortality association, the mechanisms underlying the relationship are not well understood. Little is known about how the proportion of time spent in each sleep stage relate to timing or cause of death. This dissertation is an in-depth investigation of the relationship between rapid eye movement (REM) sleep and risk of mortality. Specific aim one combines traditional and machine learning analytic approaches to evaluate whether lower levels of REM sleep would be associated with an increased rate of mortality. Sleep is comprised of multiple sleep stages that by nature are highly correlated. Therefore, it is necessary to tease apart whether another sleep stage could be a better predictor of mortality. Aim two used supervised machine learning to rank the four sleep stages from most to least predictive in context of one another. The hypotheses were increased mortality rates would be associated with lower quantities of REM sleep and that compared to other sleep stages, REM would be the best predictor of mortality. Specific aim three was to evaluate the validity, consistency, and generalizability of the findings. to do this, the final models were validated in two independent cohorts and the results from all three cohorts were combined in a meta-analysis. Materials and Methods Three longitudinal, population-based cohorts were used in this project. The Osteoporotic Fractures in Men (MrOS) sample included 2,675 older men (mean age 76.3 years ± 5.5 years) recruited from 2003 to 2005 and followed for a median of 12.1 years. The Wisconsin Sleep Cohort (WSC) started in 1988 and followed 1,386 participants (45.7% women, mean age 51.5 years ± 8.5) for a median of 20.8 years. The Sleep Heart Health Study (SHHS) was comprised of 5,550 participants (52.4% women, mean age 63.0 years ± 11.2) recruited between 1995 and 1998 and monitored for a median of 11.9 years. The exposure was percent of total sleep time spent in REM sleep and was evaluated at baseline using polysomnography. The main outcomes included all-cause and cause-specific (cardiovascular, cancer, other) mortality confirmed with death certificates. Cox proportional hazards regression models were used to evaluate the association between percent REM and mortality. The first model contained a core set of covariates selected a priori based on existing literature and clinical experience. Additional covariates commonly associated with sleep architecture were evaluated using 6-fold cross-validation with a forward step-wise feature selection algorithm to obtain the best candidates for the final multivariate regression models. A threshold effect was suspected based on Kaplan-Meier curves, so separate models were run with percent REM as a binary variable using 15% as the cut-point. Conditional inference survival tree and random survival forest analyses were conducted to identify which sleep stage(s) were driving the significance of the finding and to evaluate relevant cut-points. Several sensitivity analyses were completed to rule out alternative explanations for the findings. The findings were replicated using data from the Wisconsin Sleep Cohort (WSC) and Sleep Heart Health Study (SHHS). A meta-analysis pooled and weighted the results from all three studies to provide a global quantification of the hazard ratio. Results MrOS participants had a 13% higher mortality rate for every 5% reduction in REM sleep (percent REM standard deviation = 6.6%) after adjusting for multiple demographic, sleep, and health covariates including study site, age at sleep visit, race, education, medication use, smoking status, caffeine intake, respiratory disturbance index, and actigraphy measures (age-adjusted hazard ratio [HR] = 1.12, fully adjusted HR = 1.13, 95% CI, 1.08--1.19). The association was also present for cardiovascular disease-related mortality (CVD) (HR = 1.18, 95% CI, 1.09--1.28), cancer related mortality (HR = 1.14, 95% CI, 1.03--1.26), and non-cardiovascular, non-cancer related mortality (HR = 1.19, 95% CI, 1.10--1.28). Individuals with < 15% REM had a higher mortality rate relative to individuals with ≥15% for each mortality outcome with odds ratios ranging from 1.20 to 1.35. The random forest model identified REM as the most important sleep stage for predicting survival. In the WSC, the effect size for 5% reduction in REM on risk of all-cause mortality was similar despite the younger age, inclusion of women, and longer follow-up period (HR = 1.17, 95% CI, 1.03--1.34). When stratified by gender, lower percent REM was associated with all-cause mortality in women (HR = 1.34, 95% CI, 1.07--1.68) but was not statistically significant in men (HR = 1.09, 95% CI, 0.92--1.30). In the SHHS results were consistent with the other cohorts with a 13% increase in all-cause mortality rate for every 5% reduction in REM (HR = 1.13, 95% CI, 1.07--1.18) and a 7% increase in cardiovascular mortality rate (HR = 1.07, 95% CI, 0.97--1.17). Unlike WSC, when stratified by gender, the hazard ratio was higher in men (HR = 1.16, 95% CI, 1.08--1.17) than women (HR = 1.09, 95% CI, 1.02--1.17). Meta-analysis of the three cohorts yielded an overall hazard ratio of 1.13 (95% CI 1.10--1.17) for all-cause mortality and 1.10 (95% CI 1.03--1.16) for cardiovascular mortality. Conclusion and Relevance There was a robust association between lower levels of REM sleep and mortality in three independent cohorts, which persisted across different causes of death and multiple sensitivity analyses. Given the complex underlying biological functions, further studies are required to understand whether the relationship is causal. Accelerated brain aging may result in reduced REM sleep, making it a disease, frailty or biologic aging marker rather than a direct mortality risk factor. Mechanistic studies are needed and strategies to preserve REM may influence clinical therapies and reduce mortality risk, particularly for adults with < 15% REM
- Also online at
-
Online 7. Toward faster and more data-efficient computational biology [2019]
- Zhang, Jinye, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
Limited computing power and limited sample size are two central challenges in computational biology. While the next-generation sequencing technology offers a highly scalable way for measuring genomic information, the size of the data for a single biological sample can be as large as tens of gigabytes and presents a tremendous challenge for data storage and processing. At the same time, the data is usually ultra-high-dimensional, e.g., tens of thousands of genes or millions of mutations, requiring a large number of samples for effective inference. In this thesis, I address these two challenges by designing fundamentally better algorithms for several key applications. First, we address the limited computing power problem by presenting two works of algorithm acceleration using a strategy that we call adaptive Monte Carlo computation. Such a strategy first converts the deterministic computational problem into a statistical estimation problem and then accelerates the process by adaptive sampling. Then we move on to the limited sample size problem and consider two aspects, i.e., optimizing the experimental design and borrowing information from other datasets. For the former, we present work on the optimal experimental design for single-cell RNA-Seq. For the latter, we consider multiple hypothesis testing using side information and dimensionality reduction guided by additional datasets.
- Also online at
-
Online 8. Deep learning for inverse design of photonic devices [2022]
- Jiang, Jiaqi (Researcher of photonic devices) author.
- [Stanford, California] : [Stanford University], 2022
- Description
- Book — 1 online resource
- Summary
-
Inverse design of photonic devices is to use optimization algorithms to discover optical structures for desired functional characteristics. However, most of these inverse design problems are non-convex in a very high-dimensional space. This thesis will discuss the use of deep learning as an efficient tool for the inverse design of photonic devices. First, I apply Generative Adversarial Networks (GANs) to 3D metagrating design to augment high-performing device patterns. Next, I introduce Global Optimization Networks (GLOnets) which replace the discriminator in GANs with an electromagnetic solver and then train the generator directly from the solver by backpropagating gradients calculated by the adjoint variable method. Then we apply GLOnets to optical multi-layer thin-film stack design. Next, we analyze the mathematical principles behind GLOnets and explain their advantages. Finally, I discuss the application of neural networks as surrogate simulators to speed up simulations in the inverse design. Overall, we envision that combing the modeling capability of deep neural networks and existing physics knowledge could transform the way photonic systems are simulated and designed
- Also online at
-
Online 9. Machine learning for clinical trials and precision medicine [2022]
- Liu, Ruishan, author.
- [Stanford, California] : [Stanford University], 2022
- Description
- Book — 1 online resource
- Summary
-
Machine learning (ML) has been wildly applied in biomedicine and healthcare. The growing abundance of medical data and the advance of biological technologies (e.g. next-generation sequencing) have offered great opportunities for using ML in computational biology and health. In this thesis, I present my works contributing to this emerging field in three aspects --- using large-scale datasets to advance medical studies, developing algorithms to solve biological challenges, and building analysis tools for new technologies. In the first part, I present two works of applying ML on large-scale real-world data: one for clinical trial design and one for precision medicine. Overly restrictive eligibility criteria has been a key barrier for clinical trials. In the thesis, I introduce a powerful computational framework, Trial Pathfinder, which enables inclusive criteria and data valuation for clinical trials. A critical goal for precision medicine is to characterize how patients with specific genetic mutations respond to therapies. In the thesis, I present systematic pan-cancer analysis of mutation-treatment interactions using large real-world clinico-genomics data. In the second part, I introduce my work on developing algorithms to solve biological challenge --- aligning multiple datasets with subset correspondence information. In many biological and medical applications, we have multiple related datasets from different sources or domains, and learning efficient computational mappings between these datasets is an important problem. In the thesis, I present an end-to-end optimal transport framework that effectively leverages side information to align datasets. Finally, I present my work on developing analysis tools for new technologies --- spatial transcriptomics and RNA velocity. Recently high-throughput image-based transcriptomic methods were developed and enabled researchers to spatially resolve gene expression variation at the molecular level for the first time. In the thesis, I describe a general analysis tool to quantitatively study the spatial correlations of gene expression in fixed tissue sections. Recent development in inferring RNA velocity from single-cell RNA-seq opens up exciting new vista into developmental lineage and cellular dynamics. In the thesis, I introduce a principled computational framework that extends RNA velocity to quantify systems level dynamics and improve single-cell data analysis
- Also online at
-
Online 10. Developing new expression parts and biosynthetic pathways for yeast synthetic biology [2019]
- Kotopka, Benjamin John, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
Baker's yeast (Saccharomyces cerevisiae) has been used in bioproduction for millenia. Current efforts in synthetic biology seek to further extend yeast's capabilities, enabling applications such as producing useful and valuable plant natural products in a cheaper and less resource-intensive manner than current methods, which rely on cultivation of the native producer plants. This dissertation opens with a review (Chapter 1) of current synthetic biology strategies for producing plant natural products in heterologous hosts: model plants, bacteria, and yeast. As a case study of heterologous phytochemical production, we describe a functional reconstitution of the dhurrin pathway from sorghum (Sorghum bicolor) in yeast (Chapter 2). We produced this plant defense compound at titers of over 80 mg/L, and demonstrated a workflow using our dhurrin-producing strain to explore the activities of other S. bicolor genes. Further, we developed a method for model-driven generation of artificial yeast promoters (Chapter 3). Promoters - DNA sequences that appear 5' to genes and modulate their expression - play a central role in controlling gene regulation; however, a small set of native promoters is used for most genetic construct design in S. cerevisiae, limiting engineers' ability to control gene expression in this organism. The ability to generate and utilize models that accurately predict protein expression from promoter sequence may enable rapid generation of novel useful promoters, facilitating synthetic biology efforts in yeast. We measured the activity of over 675,000 unique sequences in a constitutive promoter library, and over 327,000 sequences in a library of inducible promoters. Training an ensemble of convolutional neural networks jointly on the two datasets enabled very high predictive accuracies on multiple prediction tasks. We developed model-guided design strategies which yielded large, sequence-diverse sets of novel promoters exhibiting activities similar to current best-in-class sequences. In addition to providing large sets of new promoters, our results show the value of model-guided design as an approach for generating DNA parts. The final chapter discusses the outlook for the field and possible extensions to the work presented here (Chapter 4). Taken together, our work shows the value of yeast as a heterologous host for producing plant natural products, and offers a means to develop novel genetic parts to further expand its usefulness.
- Also online at
-
Online 11. Rethinking single-cell RNA-Seq analysis [2019]
- Zhang, Jesse Min, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
Since the Human Genome Project was completed in 2003, scientists have developed technologies for measuring the RNA content of a single cell. In the last decade, the number of individual cells profiled per study has grown exponentially to over 1,000,000 cells. In this thesis, I will discuss some of the computational and statistical challenges associated with the analysis of such large single-cell datasets. After introducing background information, the thesis covers three main works. The first work introduces a novel, interpretable framework with the biologist end user in mind. The framework also addresses the clustering subjectivity issue by justifying its results based on a rigorous definition of cell type. This allows us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. The second work considers a novel approach for representing single-cell RNA-Seq data. We argue that gene or transcript expression vectors, while intuitive, are not the most optimal way for representing single cell genomic profiles. Rather than counting the number of reads that comes from each transcript, which requires resolving the ambiguity associated with read multimapping, we decide to count the number of reads that comes from each transcript set. We show that these new representations are both more computationally efficient to obtain and more information-rich. The third and perhaps most interesting work first observes a post-selection inference problem in standard single-cell computational pipelines. Standard pipelines perform differential analysis after clustering on the same dataset, and this reusing of the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. In summary, we discuss multiple works for drawing key insights from single-cell RNA-Seq data: a clustering method that emphasizes interpretability of results, a representation of single cells that retains more information from read data, and a framework for correcting the selection bias from standard analysis pipelines.
- Also online at
-
Online 12. Model interpretation and data valuation for machine learning [2021]
- Ghorbani, Amirata, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Machine learning is being applied in various critical applications like healthcare. In order to be able to trust a machine learning model and to repair it once it malfunctions, it is important to be able to interpret its decision-making. For example, if a model's performance is poor on a specific subgroup (gender, race, etc), it is important to find out why and fix it. In this thesis, we examine the drawbacks of existing interpretability methods and introduce new ML interpretability algorithms that are designed to tackle some of the shortcomings. Data is the labor that trains machine learning models. It is not possible to interpret an ML model's behavior without going back to the data that trained it in the first place. A fundamental challenge is how to quantify the contribution of each source of data to the model's performance. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual datum. In this thesis, we discuss principled frameworks for equitable \emph{valuation} of data; that is, given a learning algorithm and a performance metric that quantifies the performance of the resulting model, we try to find the contribution of individual datum. This thesis is divided in 3 sections, machine learning interpretability and fairness, data valuation, and machine learning for healthcare - all linked by the common goal of making the use of machine learning more responsible for the benefit of human beings
- Also online at
-
Online 13. Information processing : of humans, by humans and for humans [2019]
- Fischer-Hwang, Irena Tammy, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
The core mission of information science is to design information processing tools that aid human communication and scientific discovery. Naturally, the development of such tools is shaped by concurrent understandings of human biology and behavior. In our 21st century world, myriad human-focused investigations have opened up rich potential markets for tool design. My research explores such intersections between human biology and information processing tool design, and the bidirectional influence between the two. In this dissertation, I detail three genres of information processing: of humans, by humans and for humans. For information processing of humans--that is, of humans' genomic sequencing data--I present two bioinformatics pipelines. To demonstrate information processing by humans, I describe a novel lossy image compression framework that is rooted in the human abilities to recognize, describe and generate images. Finally, I describe a science communication effort that employs computational tools to process information and produce journalistic media for human consumption. Together, these projects support the thesis of this dissertation: that information processing tools can be used to improve our understanding of human communication needs, and that an improved understanding of human communication can, in turn, be used to design better information processing tools.
- Also online at
-
Online 14. Causal aggregation of heterogeneous datasets and stabilization of feature selection procedures [2022]
- Roquero Gimenez, Jaime, author.
- [Stanford, California] : [Stanford University], 2022
- Description
- Book — 1 online resource
- Summary
-
Variable selection is increasingly becoming a key step in any data analysis pipeline. Identifying true relationships between a large number of covariates and a response of interest is a challenging problem that often requires strong modeling assumptions. In this thesis we develop new variable selection methodologies in two different directions. Our contributions build on the recently developed knockoff procedure as well as the causal invariance framework. We show how to aggregate data originating from heterogeneous datasets generated through different experimental settings to estimate causal effects and identify the relevant covariates in a causal sense. Our methodology efficiently uses all available information, and we propose an extension to high dimensions where the number of samples available per environment is small. Finally, we develop a Causal Boosting algorithm that efficiently recovers non-linear causal response functions from multiple datasets where different subsets of covariates are randomized. On a different topic, we propose some improvements on the knockoff procedure by extending the scope of datasets where such methodology can be applied. Going beyond Gaussian distributions, we propose a Bayesian Network knockoff sampling procedure that fits a much larger class of distributions. Also, we identify sources of instability in the procedure and devise an entropy based multi-knockoff procedure to mitigate the variability of the selected set of variables induced by the randomized nature of the procedure
- Also online at
-
Online 15. Two frameworks for reliable machine learning in biology and medicine [2021]
- Abid, Abubakar, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Machine learning models are being deployed to biological and clinical settings, including here at Stanford, e.g. to analyze ultrasounds automatically or map ancestries from genomics data. However, machine learning models suffer from issues of reliability: even models with good test performance often fail in unpredictable ways when deployed to real-world settings. In this thesis, I present two frameworks for more reliable machine learning: one for supervised learning and one for unsupervised learning. In the case of supervised learning, I present Gradio (www.gradio.app), an open-source Python framework for interactively testing models on real-world data. Gradio is being used to run the first real-time clinical trial of a machine learning model in the Stanford Department of Dermatology, and has been used to validate models at Google, Siemens, Amazon, Mercy Hospital, and Harvard. In the thesis, I describe the core questions that led to the development of Gradio, and showcase applications that demonstrate the usefulness of the framework. I further describe a novel explanation method that we have developed that allows debugging of faulty models with Gradio. On the unsupervised side, I introduce the framework of contrastive datasets, which provides a more reliable way to find patterns in unlabeled data. Our framework is quite general and has been adopted for purposes such as denoising images and mapping ancestries of admixed populations. Together, these frameworks provide a way to do more reliable unsupervised and supervised machine learning
- Also online at
-
Online 16. Pharmacogenomics at scale : population analysis and machine learning applications in pharmacogenomics [2021]
- McInnes, Gregory Madden, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Pharmacogenomics promises the ability to provide personalized therapeutic guidance for patients based on their genetics. Understanding how genetic variation leads to heterogeneity in drug response could dramatically increase patient outcomes by increasing efficacy and minimizing adverse drug reactions. Pharmacogenomics research has historically suffered from small sample sizes, which limits our understanding of global allele frequencies, drug-gene associations, and the identification and functional assessment of rare variants. In recent years, biobanks containing phenotype-linked genetic data for hundreds of thousands of participants have become available, presenting an unprecedented opportunity for population-scale analysis of the effect of genetics on drug response. Simultaneously, the capabilities of deep learning algorithms have advanced significantly, enabling powerful predictions about properties of DNA sequence data. This dissertation illustrates how biobanks can be used to study pharmacogenetics and how deep neural networks can be used to predict metabolic function of haplotypes in pharmacogenes. I present results from the largest pharmacogenetic study to date, analyzing pharmacogenetic allele and phenotype frequencies and a cohort of 500,000 individuals. I demonstrate how these data can be used to discover drug-gene associations and drug-gene-response associations through genome-wide and candidate gene studies by integrating clinical records and prescription data. Finally, I present a novel deep learning approach to predicting haplotype function in an important drug-metabolizing enzyme, CYP2D6
- Also online at
-
Online 17. Recommendations for algorithmic fairness assessments of predictive models in healthcare : evidence from large-scale empirical analyses [2021]
- Pfohl, Stephen Robert, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
The use of machine learning to develop predictive models that inform clinical decision making has the potential to introduce and exacerbate health inequity. A growing body of work has framed these issues as ones of algorithmic fairness, seeking to develop techniques to anticipate and proactively mitigate harms. The central aim of my work is to provide and justify practical recommendations for the development and evaluation of clinical predictive models in alignment with these principles. Using evidence derived from large-scale empirical studies, I demonstrate that, when it is assumed that the predicted outcome is not subject to differential measurement error across groups and threshold selection is unconstrained, approaches that aim to incorporate fairness considerations into the learning objective used for model development typically do not improve model performance or confer greater net benefit for any of the studied patient populations compared to standard learning paradigms. For evaluation in this setting, I advocate for the use of criteria that assess the calibration properties of predictive models across groups at clinically-relevant decision thresholds. To contextualize the interplay between measures of model performance, fairness, and benefit, I present a case study for models that estimate the ten-year risk of atherosclerotic cardiovascular disease to inform statin initiation. Finally, I caution that standard observational analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential for technical approaches to counteract those mechanisms, and argue for refocusing algorithmic fairness efforts in healthcare on participatory design, transparent model reporting, auditing, and reasoning about the impact of model-enabled interventions in context
- Also online at
-
Online 18. Development and deployment of machine learning in medicine [2023]
- He, Bryan Dawei, author.
- [Stanford, California] : [Stanford University], 2023
- Description
- Book — 1 online resource
- Summary
-
Recent advances in machine learning have enabled important applications in medicine, where many critical tasks are tedious and time-consuming for clinicians to perform. This dissertation presents work on using machine learning for cardiology, pathology, and RNA sequencing. This dissertation begins with several applications of machine learning in cardiology, focusing on echocardiograms, or ultrasounds of the heart. Conventional assessment of echocardiograms requires tedious annotation by a human expert. First, I introduce EchoNet-Dynamic, an algorithm for assessing cardiac function from echocardiograms. EchoNet-Dynamic is then integrated into a clinical system and evaluated with a blinded randomized clinical trial. Extensions of the algorithm to pediatric patients and emergency department point-of-care echocardiograms are then presented. This dissertation then presents work applying machine learning to pathology and RNA sequencing. First, I present in silico-IHC, which predicts immunohistochemical stains from commonly available histochemically-stained tissue samples. Next, I present ST-Net, which combines RNA sequencing and pathology by estimating spatial transcriptomics measurements from microscopy images. Finally, I present CloudPred, which predicts patient phenotypes from single-cell RNA sequencing data
- Also online at
-
Online 19. Transcribing real-valued sequences with deep neural networks [electronic resource] [2018]
- Hannun, Awni.
- 2018.
- Description
- Book — 1 online resource.
- Summary
-
Speech recognition and arrhythmia detection from electrocardiograms are examples of problems which can be formulated as transcribing real-valued sequences. These problems have traditionally been solved with frameworks like the Hidden Markov Model. To generalize well, these models rely on carefully hand engineered building blocks. More general, end-to-end neural networks capable of learning from much larger datasets can achieve lower error rates. However, getting these models to work well in practice has other challenges. In this work, we present end-to-end models for transcribing real-valued sequences and discuss several applications of these models. The first is detecting abnormal heart activity in electrocardiograms. The second is large vocabulary continuous speech recognition. Finally, we investigate the tasks of keyword spotting and voice activity detection. In all cases we show how to scale high capacity models to unprecedentedly large datasets. With these techniques we can achieve performance comparable to that of human experts for both arrhythmia detection and speech recognition and state-of-the-art error rates in speech recognition for multiple languages.
- Also online at
-
Special Collections
Special Collections | Status |
---|---|
University Archives | Request via Aeon (opens in new tab) |
3781 2018 H | In-library use |
Online 20. Inference on the generalization error of machine learning algorithms and the design of hierarchical medical term embeddings [2023]
- Cai, Bryan, author.
- [Stanford, California] : [Stanford University], 2023.
- Description
- Book — 1 online resource.
- Summary
-
This dissertation comprises three papers that address important challenges in applying statistical and machine learning techniques in biomedical research, ranging from valid statistical inference on evaluating algorithm performance via general cross-validation, to a new embedding method for biomedical terms based on their hierarchical structure for better downstream applications, to improving model performance by balancing the prediction accuracy and the cost of collecting relevant prediction features. The first paper introduces a novel fast bootstrap method to estimate the standard error of cross-validation estimates. Cross-validation helps avoid the optimism bias in error estimates, which can be significant for models built using complex statistical learning algorithms. However, since the cross-validation estimate is a random value dependent on observed data, it is essential to accurately quantify the uncertainty associated with this estimate. This is especially important when comparing the performance of two models, as one must determine whether differences in error estimates are a result of chance fluctuations. Although various methods have been developed for making inferences on cross-validation estimates, they often have many limitations, such as stringent model assumptions or constraints on the form of the loss function. This paper proposes an accelerated bootstrap method that quickly estimates the standard error of the cross-validation estimate and produces valid confidence intervals for a population parameter measuring average model performance. Our method overcomes the computational challenge inherent in bootstrapping a cross-validation estimate by estimating the variance component via fitting a random effects model. To showcase the effectiveness of our approach, we employ comprehensive simulations and real data analysis across three diverse applications. The second paper presents a novel biomedical term representation model fine-tuned on hierarchical structures. Electronic health records contain narrative notes that provide extensive details on the medical condition and management of patients. Natural language processing of clinical notes can use observed frequencies of clinical terms as predictive features for various downstream applications such as clinical decision making and patient trajectory prediction. However, due to the vast number of highly similar and related clinical concepts, a more effective modeling strategy is to represent clinical terms as semantic embeddings via representation learning and use the low dimensional embeddings as more informative feature vectors for those applications. Fine-tuning pre-trained language models with biomedical knowledge graphs may generate better embeddings for biomedical terms than those from standard language models alone. These embeddings can effectively discriminate synonymous pairs from those that are unrelated. However, they often fail to capture different degrees of similarity or relatedness for concepts that are hierarchical in nature. To overcome this limitation, we propose HiPrBERT, a biomedical term representation model trained on additional data sources containing hierarchical structures for various biomedical terms. We modify existing contrastive loss functions to extract information from these hierarchies. Our numerical experiments demonstrate that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in substantially more informative embeddings for further biomedical applications. The third paper proposes a dynamic prediction rule for clinical decision-making, aiming to optimize the order of acquiring prediction features. Physicians today have access to a wide array of tests for diagnosing and prognosticating medical conditions. Ideally, they would apply a high-quality prediction model, utilizing all relevant features as input, to facilitate appropriate decision-making regarding treatment selection or risk assessment. However, not all features used in these prediction models are readily available without incurring some costs. In practice, predictors are typically gathered as needed in a sequential manner, while the physician dynamically evaluates this information. This process continues until sufficient information is acquired, and the physician gains reasonable confidence in making a decision. Importantly, the prospective information to collect may differ for each patient and depend on the predictor values already known. Our method aims to address these challenges, with the objective of maximizing the prediction accuracy while minimizing the costs associated with measuring prediction features for individual subjects. To achieve this, we employ a reinforcement learning algorithm, where the agent must decide on the best action at each step: either making a clinical decision with available information or continuing to collect new predictors based on the current state of knowledge. To evaluate the efficacy of the proposed dynamic prediction strategy, we've conducted extensive simulation studies. Additionally, we provide two real data examples to illustrate the practical application of our method.
- Also online at
-
Articles+
Journal articles, e-books, & other e-resources
Guides
Course- and topic-based guides to collections, tools, and services.