Learning latent byte-level feature representation for malware detection

Mahmood Yousefi-Azar, Len Hamey, Vijay Varadharajan, Shiping Chen

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.

LanguageEnglish
Title of host publicationNeural Information Processing
Subtitle of host publication25th International Conference, ICONIP 2018, Proceedings, Part IV
EditorsLong Cheng, Andrew Chi Sing Leung, Seiichi Ozawa
Place of PublicationSwitzerland
PublisherSpringer-VDI-Verlag GmbH & Co. KG
Pages568-578
Number of pages11
ISBN (Electronic)9783030042127
ISBN (Print)9783030042110
DOIs
Publication statusPublished - 16 Dec 2018
Event25th International Conference on Neural Information Processing, ICONIP 2018 - Siem Reap, Cambodia
Duration: 13 Dec 201816 Dec 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11304 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference25th International Conference on Neural Information Processing, ICONIP 2018
CountryCambodia
CitySiem Reap
Period13/12/1816/12/18

Fingerprint

Malware
Binary
Application programs
N-gram
Classifiers
Term
Feature Space
Semantics
Operating Systems
Classifier
Learning
Experiments
Experiment

Keywords

  • Binary Word2vec
  • Binary-level feature representation
  • Malware detection
  • Sparse term-frequency simhashing

Cite this

Yousefi-Azar, M., Hamey, L., Varadharajan, V., & Chen, S. (2018). Learning latent byte-level feature representation for malware detection. In L. Cheng, A. C. S. Leung, & S. Ozawa (Eds.), Neural Information Processing: 25th International Conference, ICONIP 2018, Proceedings, Part IV (pp. 568-578). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11304 LNCS). Switzerland: Springer-VDI-Verlag GmbH & Co. KG. https://doi.org/10.1007/978-3-030-04212-7_50
Yousefi-Azar, Mahmood ; Hamey, Len ; Varadharajan, Vijay ; Chen, Shiping. / Learning latent byte-level feature representation for malware detection. Neural Information Processing: 25th International Conference, ICONIP 2018, Proceedings, Part IV. editor / Long Cheng ; Andrew Chi Sing Leung ; Seiichi Ozawa. Switzerland : Springer-VDI-Verlag GmbH & Co. KG, 2018. pp. 568-578 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{f4c05c4c036845aa901b48b30bd18825,
title = "Learning latent byte-level feature representation for malware detection",
abstract = "This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.",
keywords = "Binary Word2vec, Binary-level feature representation, Malware detection, Sparse term-frequency simhashing",
author = "Mahmood Yousefi-Azar and Len Hamey and Vijay Varadharajan and Shiping Chen",
year = "2018",
month = "12",
day = "16",
doi = "10.1007/978-3-030-04212-7_50",
language = "English",
isbn = "9783030042110",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer-VDI-Verlag GmbH & Co. KG",
pages = "568--578",
editor = "Long Cheng and Leung, {Andrew Chi Sing} and Seiichi Ozawa",
booktitle = "Neural Information Processing",
address = "Germany",

}

Yousefi-Azar, M, Hamey, L, Varadharajan, V & Chen, S 2018, Learning latent byte-level feature representation for malware detection. in L Cheng, ACS Leung & S Ozawa (eds), Neural Information Processing: 25th International Conference, ICONIP 2018, Proceedings, Part IV. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11304 LNCS, Springer-VDI-Verlag GmbH & Co. KG, Switzerland, pp. 568-578, 25th International Conference on Neural Information Processing, ICONIP 2018, Siem Reap, Cambodia, 13/12/18. https://doi.org/10.1007/978-3-030-04212-7_50

Learning latent byte-level feature representation for malware detection. / Yousefi-Azar, Mahmood; Hamey, Len; Varadharajan, Vijay; Chen, Shiping.

Neural Information Processing: 25th International Conference, ICONIP 2018, Proceedings, Part IV. ed. / Long Cheng; Andrew Chi Sing Leung; Seiichi Ozawa. Switzerland : Springer-VDI-Verlag GmbH & Co. KG, 2018. p. 568-578 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11304 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Learning latent byte-level feature representation for malware detection

AU - Yousefi-Azar, Mahmood

AU - Hamey, Len

AU - Varadharajan, Vijay

AU - Chen, Shiping

PY - 2018/12/16

Y1 - 2018/12/16

N2 - This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.

AB - This paper proposes two different byte level feature representations of binary files for malware detection. The proposed static feature representations do not need any third-party tools and are independent of the operating system because they operate on the raw file bytes. Sparse term-frequency simhashing (s-tf-simhashing) is a faster type of tf-simhashing. S-tf-simhashing requires less computation and outperforms the original dense tf-simhashing. The binary word2vec (Bword2vec) representation embeds the semantic relationships of the n-grams into the code vectors. Bword2vec employs a binary to word2vec representation that reduces the feature space dimension than s-tf-simhashing and thus further reducing the computation of the classifier. We show that the proposed techniques can successfully be used for both analyzing of full malware apps and infected files. The experiments are conducted on real Android and PDF malware datasets.

KW - Binary Word2vec

KW - Binary-level feature representation

KW - Malware detection

KW - Sparse term-frequency simhashing

UR - http://www.scopus.com/inward/record.url?scp=85058984513&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-04212-7_50

DO - 10.1007/978-3-030-04212-7_50

M3 - Conference proceeding contribution

SN - 9783030042110

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 568

EP - 578

BT - Neural Information Processing

A2 - Cheng, Long

A2 - Leung, Andrew Chi Sing

A2 - Ozawa, Seiichi

PB - Springer-VDI-Verlag GmbH & Co. KG

CY - Switzerland

ER -

Yousefi-Azar M, Hamey L, Varadharajan V, Chen S. Learning latent byte-level feature representation for malware detection. In Cheng L, Leung ACS, Ozawa S, editors, Neural Information Processing: 25th International Conference, ICONIP 2018, Proceedings, Part IV. Switzerland: Springer-VDI-Verlag GmbH & Co. KG. 2018. p. 568-578. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-04212-7_50