Efficient permutation P-value estimation for gene set tests [electronic resource]
- Responsibility
- Yu He.
- Imprint
- 2016.
- Physical description
- 1 online resource.
Digital content
Also available at
At the library
Special Collections
Limited on-site access
Researchers in the Stanford community can request to view these materials in the Special Collections Reading Room. Entry to the Reading Room is by appointment only.
| Call number | Status |
|---|---|
| 3781 2016 H | In-library use |
Context
Item is featured in an exhibit Item is featured in exhibits
More options
Description
Creators/Contributors
- Author/Creator
- He, Yu.
- Contributor
- Owen, Art B. primary advisor.
- Hastie, Trevor advisor.
- Wong, Wing Hung advisor.
- Stanford University. Department of Statistics.
Contents/Summary
- Summary
- In a genome-wide expression study, gene set testing is often used to find potential gene sets that correlate with a treatment(disease, drug, phenotype etc.). A gene set may contain tens to thousands genes, and genes within a gene set are generally correlated. Permutation tests are standard approaches of getting p-values for these gene set tests. Plain Monte Carlo methods that generate random permutations can be computationally infeasible for small p-values. Ackermann and Strimmer (2009) finds two families of test statistics that achieve overall best performances - a linear family and a quadratic family. This dissertation first reviews the relative background of gene set testing and permutation tests, and then provides three alternative approaches to estimate small permutation p-values efficiently. The first approach focuses on the linear statistic. Observing the p-value can be written as the proportion of points lying in a spherical cap, the p-value is approximated by the volume of a spherical cap. Error estimates can be derived from generalized Stolarsky's invariance principal, and alternative probabilistic proofs are provided. The second approach focuses on the quadratic statistic. Importance sampling is used to estimate the area of the (continuous) significant region on the sphere, and the volume of the region is used as an approximation for the (discrete proportion) p-value. Different proposal distributions are studied and compared. The third approach estimates the p-value with nested sampling. It may work for both the linear and the quadratic statistic. Similar ideas can be found in literature spanning from combinatorics, sequential Monte Carlo, Bayesian computation, rare event estimation, network reliability etc., and bears different names, e.g. approximate counting, nested sampling, subset simulation, multilevel splitting etc. We give a thorough review of literature in these different areas, and apply the technique to the gene set testing with the quadratic test statistic. Finally, we compare the proposed methods with plain Monte Carlo and saddle- point approximation on three expression studies in Parkinson's Disease patients. This work was supported by the US National Science Foundation under grant DMS-1521145.
Bibliographic information
- Publication date
- 2016
- Note
- Submitted to the Department of Statistics.
- Note
- Thesis (Ph.D.)--Stanford University, 2016.