Gene-disease association with literature based enrichment

Guy Tsafnat, Dennis Jasch, Agam Misra, Miew Keen Choong, Frank P Y Lin, Enrico Coiera

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Motivation: Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. Methods: Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). Results: Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. Discussion: Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date. Conclusion: Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs. Availability: The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.

LanguageEnglish
Pages221-226
Number of pages6
JournalJournal of Biomedical Informatics
Volume49
DOIs
Publication statusPublished - 2014
Externally publishedYes

Fingerprint

Genes
Association reactions
Area Under Curve
Colorectal Neoplasms
Gene expression
Breast Neoplasms
Microarrays
Gene Expression
PubMed
Unified Medical Language System
Chemical Databases
Literature
ROC Curve
Biomedical Research
Availability

Cite this

Tsafnat, Guy ; Jasch, Dennis ; Misra, Agam ; Choong, Miew Keen ; Lin, Frank P Y ; Coiera, Enrico. / Gene-disease association with literature based enrichment. In: Journal of Biomedical Informatics. 2014 ; Vol. 49. pp. 221-226.
@article{32c8e15ea9f34894aee08f47821bdd51,
title = "Gene-disease association with literature based enrichment",
abstract = "Motivation: Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. Methods: Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). Results: Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. Discussion: Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date. Conclusion: Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs. Availability: The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.",
author = "Guy Tsafnat and Dennis Jasch and Agam Misra and Choong, {Miew Keen} and Lin, {Frank P Y} and Enrico Coiera",
year = "2014",
doi = "10.1016/j.jbi.2014.03.007",
language = "English",
volume = "49",
pages = "221--226",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",

}

Gene-disease association with literature based enrichment. / Tsafnat, Guy; Jasch, Dennis; Misra, Agam; Choong, Miew Keen; Lin, Frank P Y; Coiera, Enrico.

In: Journal of Biomedical Informatics, Vol. 49, 2014, p. 221-226.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Gene-disease association with literature based enrichment

AU - Tsafnat, Guy

AU - Jasch, Dennis

AU - Misra, Agam

AU - Choong, Miew Keen

AU - Lin, Frank P Y

AU - Coiera, Enrico

PY - 2014

Y1 - 2014

N2 - Motivation: Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. Methods: Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). Results: Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. Discussion: Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date. Conclusion: Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs. Availability: The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.

AB - Motivation: Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. Methods: Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). Results: Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. Discussion: Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date. Conclusion: Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs. Availability: The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.

UR - http://www.scopus.com/inward/record.url?scp=84902543256&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2014.03.007

DO - 10.1016/j.jbi.2014.03.007

M3 - Article

VL - 49

SP - 221

EP - 226

JO - Journal of Biomedical Informatics

T2 - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

ER -