Byte2vec

malware representation and feature selection for Android

Mahmood Yousefi-Azar, Len Hamey*, Vijay Varadharajan, Shiping Chen

*Corresponding author for this work

Research output: Contribution to journalArticle

Abstract

Malware detection based on static features and without code disassembling is a challenging path of research. Obfuscation makes the static analysis of malware even more challenging. This paper extends static malware detection beyond byte level n-grams and detecting important strings. We propose a model (Byte2vec) with the capabilities of both binary file feature representation and feature selection for malware detection. Byte2vec embeds the semantic similarity of byte level codes into a feature vector (byte vector) and also into a context vector. The learned feature vectors of Byte2vec, using skip-gram with negative-sampling topology, are combined with byte-level term-frequency (tf) for malware detection. We also show that the distance between a feature vector and its corresponding context vector provides a useful measure to rank features. The top ranked features are successfully used for malware detection. We show that this feature selection algorithm is an unsupervised version of mutual information (MI). We test the proposed scheme on four freely available Android malware datasets including one obfuscated malware dataset. The model is trained only on clean APKs. The results show that the model outperforms MI in a low-dimensional feature space and is competitive with MI and other state-of-the-art models in higher dimensions. In particular, our tests show very promising results on a wide range of obfuscated malware with a false negative rate of only 0.3% and a false positive rate of 2.0%. The detection results on obfuscated malware show the advantage of the unsupervised feature selection algorithm compared with the MI-based method.

Original languageEnglish
Pages (from-to)1125-1138
Number of pages14
JournalComputer Journal
Volume63
Issue number8
DOIs
Publication statusPublished - Aug 2020

Keywords

  • malware detection
  • feature learning
  • unsupervised feature selection
  • byte-level feature
  • Byte2Vec

Fingerprint Dive into the research topics of 'Byte2vec: malware representation and feature selection for Android'. Together they form a unique fingerprint.

Cite this