Prediction of novel mouse TLR9 agonists using a random forest approach

Varun Khanna, Lei Li, Johnson Fung, Shoba Ranganathan, Nikolai Petrovsky

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. 

Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. 

Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

LanguageEnglish
Article number56
Pages1-14
Number of pages14
JournalBMC Molecular and Cell Biology
Volume20
Issue numberSuppl 2
DOIs
Publication statusPublished - 20 Dec 2019

Fingerprint

Toll-Like Receptor 9
Single-Stranded DNA
Cytosine
Guanine
Discriminant Analysis
Oligonucleotides
Computer Simulation
Communicable Diseases
Immune System
Learning
Machine Learning
Datasets
Neoplasms
Forests
Recognition (Psychology)
Support Vector Machine

Bibliographical note

Copyright the Author(s) 2019. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • CpG
  • Imbalanced data
  • Machine learning
  • Oligonucleotides
  • Random Forest
  • Toll-like receptor 9

Cite this

Khanna, Varun ; Li, Lei ; Fung, Johnson ; Ranganathan, Shoba ; Petrovsky, Nikolai. / Prediction of novel mouse TLR9 agonists using a random forest approach. In: BMC Molecular and Cell Biology. 2019 ; Vol. 20, No. Suppl 2. pp. 1-14.
@article{49ce0b8a6f344e27a1fc4ae3e75cba8f,
title = "Prediction of novel mouse TLR9 agonists using a random forest approach",
abstract = "Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0{\%}, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0{\%} and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.",
keywords = "CpG, Imbalanced data, Machine learning, Oligonucleotides, Random Forest, Toll-like receptor 9",
author = "Varun Khanna and Lei Li and Johnson Fung and Shoba Ranganathan and Nikolai Petrovsky",
note = "Copyright the Author(s) 2019. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.",
year = "2019",
month = "12",
day = "20",
doi = "10.1186/s12860-019-0241-0",
language = "English",
volume = "20",
pages = "1--14",
journal = "BMC Molecular and Cell Biology",
issn = "2661-8850",
publisher = "Springer, Springer Nature",
number = "Suppl 2",

}

Prediction of novel mouse TLR9 agonists using a random forest approach. / Khanna, Varun; Li, Lei; Fung, Johnson ; Ranganathan, Shoba; Petrovsky, Nikolai.

In: BMC Molecular and Cell Biology, Vol. 20, No. Suppl 2, 56, 20.12.2019, p. 1-14.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Prediction of novel mouse TLR9 agonists using a random forest approach

AU - Khanna, Varun

AU - Li, Lei

AU - Fung, Johnson

AU - Ranganathan, Shoba

AU - Petrovsky, Nikolai

N1 - Copyright the Author(s) 2019. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

PY - 2019/12/20

Y1 - 2019/12/20

N2 - Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

AB - Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

KW - CpG

KW - Imbalanced data

KW - Machine learning

KW - Oligonucleotides

KW - Random Forest

KW - Toll-like receptor 9

UR - http://www.scopus.com/inward/record.url?scp=85076944701&partnerID=8YFLogxK

U2 - 10.1186/s12860-019-0241-0

DO - 10.1186/s12860-019-0241-0

M3 - Article

VL - 20

SP - 1

EP - 14

JO - BMC Molecular and Cell Biology

T2 - BMC Molecular and Cell Biology

JF - BMC Molecular and Cell Biology

SN - 2661-8850

IS - Suppl 2

M1 - 56

ER -