Skip to main navigation Skip to search Skip to main content

Viral genome prediction from raw human DNA sequence samples by combining natural language processing and machine learning techniques

Mohammad H. Alshayeji, Silpa ChandraBhasi Sindhu, Sa'ed Abed

Research output: Contribution to journalArticlepeer-review

Abstract

Infection with a virus can lead to a range of illnesses in humans, including cancer. When viruses infect a host, they may disrupt normal host function and cause deadly diseases. Understanding complicated viral illnesses requires novel viral genome prediction. Since many of the sequences in assembled contigs from human samples are not identical to known genomes, many assembled contigs are labeled “unknown” by conventional alignments. In this study, sequences from 19 metagenomic investigations were used to create the model proposed here, and these sequences were examined and classified using BLAST. We implemented k-mer counting and the bag-of-words technique using CountVectorizer. As far as we are aware, this work represents the first framework that combines natural language processing (NLP) along with traditional ML classification approaches on raw metagenomic contigs to automatically identify viruses in a variety of human biospecimens. The suggested models are general rather than specialized to a particular viral family. Since the proposed methodology is precise and simple, we may incorporate it into computer-aided diagnosis (CAD) systems to make day-to-day hospital activities easier. In the last stage, binary classification of deoxyribonucleic acid (DNA) with normal and viral genomes was performed using traditional ML classifiers. Using the KNN classifier, the suggested model achieved 98.6% classification accuracy along with 98.5% precision, 98.6% recall, 0.984 F1 score, 0.896 Matthews correlation coefficient, 0.895 kappa, 0.97 classification success index and detection rate of 98.6% for the prediction of viral genomes in DNA. Compared to previously developed ML techniques, the model achieved a significantly greater performance for viral genome prediction.
Original languageEnglish
Article number119641
Pages (from-to)1-10
Number of pages10
JournalExpert Systems with Applications
Volume218
Early online date1 Feb 2023
DOIs
Publication statusPublished - 15 May 2023
Externally publishedYes

Bibliographical note

Copyright the Author(s) 2023. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Metagenome
  • Machine learning
  • Human DNA
  • NLP
  • K-mer counting
  • Bag of words

Fingerprint

Dive into the research topics of 'Viral genome prediction from raw human DNA sequence samples by combining natural language processing and machine learning techniques'. Together they form a unique fingerprint.

Cite this