Unsupervised learning across multiple datasets [electronic resource]
- Katie Planey.
- Physical description
- 1 online resource.
- Planey, Katie.
- Gevaert, Olivier Michel Simonne, primary advisor.
- Musen, Mark A., advisor.
- Salzman, Julia, advisor.
- Stanford University. Program in Biomedical Informatics.
- Subtypes define distinctive subgroups of objects found within a larger cohort; these subtypes can help domain experts define actionable recommendations for each subgroup to improve outcomes. With the relatively recent explosion of large datasets accompanied by large numbers of features, a popular way to define subtypes is unsupervised learning, or clustering, algorithms. Unfortunately, unsupervised learning algorithms have a serious drawback: there is no ground truth. While a set of clusters may correlate strongly with an outcomes variable, an outcomes, or response, variable, is not used in an unsupervised learning algorithm; this means that the accuracy of clusters derived from such algorithms, by nature, cannot be quantified. One way to ensure subtypes represent true signal is to conduct the clustering analysis on multiple datasets. However, there is a lack of methods for unsupervised learning across multiple datasets. In this dissertation, I propose novel methods for unsupervised clustering across multiple datasets, by finding a consensus across clusters derived from each individual dataset. I propose an algorithm, COINCIDE, that encompasses these novel methods; COINCIDE interprets each cluster as a node in a network. I apply COINCIDE to cancer gene expression and pathology datasets, and finally sepsis gene expression datasets, to illustrate the ability of COINCIDE to conduct unsupervised learning across multiple datasets to discover robust subtypes.
- Publication date
- Submitted to the Program in Biomedical Informatics.
- Thesis (Ph.D.)--Stanford University, 2015.