Integrative exploration of large high-dimensional datasets

Christopher Pardy, Sally Galbraith, Susan R. Wilson

Research output: Contribution to journalArticlepeer-review

9 Citations (Scopus)

Abstract

Large, high-dimensional datasets containing different types of variables are becoming increasingly common. For exploring such data, there is a need for integrated methods. For example, a single genomic experiment can contain large quantities of different types of data (including clinical data) that make it a challenge to coherently describe the patterns of variability within and between the inter-related datasets. Mutual information (MI) is a widely used information theoretic dependency measure that also can identify nonlinear and nonmonotonic associations. First, we develop a computationally efficient implementation of MI between a discrete and a continuous variable. This implementation allows us to apply a coherent approach to all comparisons arising from continuous and categorical data. As commonly applied, MI can have high levels of bias. So we present a novel development of mutual information (MI) that reduces the bias, and that we term bias corrected mutual information (BCMI). Further, BCMI is useful as an association measure that can be incorporated in subsequent analyses such as clustering and visualisation procedures. To demonstrate our approach, a genomic dataset is re-examined. This dataset contains single nucleotide polymorphisms (SNPs, a discrete variable), gene expression levels and clinical data (all continuous variables). Our approach allows us to integrate these different types of data by exploring associations both within and between these types of variables.

Original languageEnglish
Pages (from-to)178-199
Number of pages22
JournalAnnals of Applied Statistics
Volume12
Issue number1
DOIs
Publication statusPublished - Mar 2018
Externally publishedYes

Keywords

  • Categorical
  • Continuous
  • Data integration
  • Exploration
  • Mixed-types of variables
  • Mutual information

Cite this