Abstract
Infection with a virus can lead to a range of illnesses in humans, including cancer. When viruses infect a host, they may disrupt normal host function and cause deadly diseases. Understanding complicated viral illnesses requires novel viral genome prediction. Since many of the sequences in assembled contigs from human samples are not identical to known genomes, many assembled contigs are labeled “unknown” by conventional alignments. In this study, sequences from 19 metagenomic investigations were used to create the model proposed here, and these sequences were examined and classified using BLAST. We implemented k-mer counting and the bag-of-words technique using CountVectorizer. As far as we are aware, this work represents the first framework that combines natural language processing (NLP) along with traditional ML classification approaches on raw metagenomic contigs to automatically identify viruses in a variety of human biospecimens. The suggested models are general rather than specialized to a particular viral family. Since the proposed methodology is precise and simple, we may incorporate it into computer-aided diagnosis (CAD) systems to make day-to-day hospital activities easier. In the last stage, binary classification of deoxyribonucleic acid (DNA) with normal and viral genomes was performed using traditional ML classifiers. Using the KNN classifier, the suggested model achieved 98.6% classification accuracy along with 98.5% precision, 98.6% recall, 0.984 F1 score, 0.896 Matthews correlation coefficient, 0.895 kappa, 0.97 classification success index and detection rate of 98.6% for the prediction of viral genomes in DNA. Compared to previously developed ML techniques, the model achieved a significantly greater performance for viral genome prediction.
| Original language | English |
|---|---|
| Article number | 119641 |
| Pages (from-to) | 1-10 |
| Number of pages | 10 |
| Journal | Expert Systems with Applications |
| Volume | 218 |
| Early online date | 1 Feb 2023 |
| DOIs | |
| Publication status | Published - 15 May 2023 |
| Externally published | Yes |
Bibliographical note
Copyright the Author(s) 2023. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.Keywords
- Metagenome
- Machine learning
- Human DNA
- NLP
- K-mer counting
- Bag of words
Fingerprint
Dive into the research topics of 'Viral genome prediction from raw human DNA sequence samples by combining natural language processing and machine learning techniques'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver