VariantSpark: Applying Spark-based machine learning methods to genomic information

Aidan R. O'Brien, Denis C. Bauer*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Catering for this need, we developed VariantSpark, a framework for applying machine learning algorithms in MLlib to genomic variant data using the efficient in-memory Spark compute engine. We demonstrate a speedup compared to our earlier Hadoop-based implementation as well as a published Spark-based approach using the ADAM framework. Adapting emerging methodologies for fast efficient analysis of large diverse data volumes to process genomic information, will be the cornerstone for precision genome medicine, which promises tailored healthcare for diagnosis and treatment.

Original languageEnglish
Pages (from-to)1-6
Number of pages6
JournalCEUR Workshop Proceedings
Publication statusPublished - 2015
Externally publishedYes
EventBig Data in Health Analytics 2015 - Sydney, Australia
Duration: 20 Oct 201521 Oct 2015


  • Clustering
  • Genomics
  • Spark


Dive into the research topics of 'VariantSpark: Applying Spark-based machine learning methods to genomic information'. Together they form a unique fingerprint.

Cite this