Learning latent byte-level feature representation for malware detection

Mahmood Yousefi-Azar*, Len Hamey, Vijay Varadharajan, Shiping Chen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

3 Citations (Scopus)


This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.

Original languageEnglish
Title of host publicationNeural Information Processing
Subtitle of host publication25th International Conference, ICONIP 2018, Proceedings, Part IV
EditorsLong Cheng, Andrew Chi Sing Leung, Seiichi Ozawa
Place of PublicationSwitzerland
PublisherSpringer-VDI-Verlag GmbH & Co. KG
Number of pages11
ISBN (Electronic)9783030042127
ISBN (Print)9783030042110
Publication statusPublished - 16 Dec 2018
Event25th International Conference on Neural Information Processing, ICONIP 2018 - Siem Reap, Cambodia
Duration: 13 Dec 201816 Dec 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11304 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference25th International Conference on Neural Information Processing, ICONIP 2018
CitySiem Reap


  • Binary Word2vec
  • Binary-level feature representation
  • Malware detection
  • Sparse term-frequency simhashing

Fingerprint Dive into the research topics of 'Learning latent byte-level feature representation for malware detection'. Together they form a unique fingerprint.

Cite this