Model selection procedures for high dimensional genomic data

Allan J. Motyer*, Sally Galbraith, Susan R. Wilson

*Corresponding author for this work

Research output: Contribution to journalConference paperpeer-review

Abstract

Many complex diseases are thought to be caused by multiple genetic variants. Recent advances in genotyping technology allowed investi- gators of a complex disease to obtain data for a massive number of candidate genetic variants. Typically each candidate variant is tested individually for an association with the disease. We approach the problem as one of model selection for high dimensional data. We propose a method whereby penalised maximum likelihood estimation provides a reasonably sized set of variants for inclusion in our model. We then perform stepwise regression on this set of variants to arrive at our model. Penalised maximum likelihood estimation is performed with both the lasso and a more recently developed method known as the hyperlasso, with smoothing parameters chosen by cross-validation. The hyperlasso has a penalty function that favours sparser solutions but with less shrinkage of those variables that are included in the model, when compared to the lasso; however, this comes at extra com- putational cost. We apply the above method to a large genomic data set from a previously published mice obesity study and use resample model averaging to assess model performance.

Original languageEnglish
Pages (from-to)C364-C378
Number of pages15
JournalANZIAM Journal
Volume52
DOIs
Publication statusPublished - 2010
Externally publishedYes
EventBiennial Computational Techniques and Applications Conference (CTAC2010) (15th : 2010) - University of New South Wales, Sydney, Australia
Duration: 28 Nov 20101 Dec 2010
Conference number: 15th

Cite this