Mining source code topics through topic model and words embedding

Wei Emma Zhang*, Quan Z. Sheng, Ermyas Abebe, M. Ali Babar, Andi Zhou

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

2 Citations (Scopus)

Abstract

Developers nowadays can leverage existing systems to build their own applications. However, a lack of documentation hinders the process of software system reuse. We examine the problem of mining topics (i.e., topic extraction) from source code, which can facilitate the comprehension of the software systems. We propose a topic extraction method, Embedded Topic Extraction (EmbTE), that considers word semantics, which are never considered in mining topics from source code, by leveraging word embedding techniques. We also adopt Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract topics from source code. Moreover, an automated term selection algorithm is proposed to identify the most contributory terms from source code for the topic extraction task. The empirical studies on Github (https://github.com/) Java projects show that EmbTE outperforms other methods in terms of providing more coherent topics. The results also indicate that method name, method comments, class names and class comments are the most contributory types of terms to source code topic extraction.

Original languageEnglish
Title of host publicationAdvanced data mining and applications
Subtitle of host publication12th International Conference, ADMA 2016, Gold Coast, QLD, Australia, December 12–15, 2016, proceedings
Place of PublicationBerlin; New York
PublisherSpringer, Springer Nature
Pages664-676
Number of pages13
ISBN (Print)9783319495859
DOIs
Publication statusPublished - 2016
Externally publishedYes
Event12th International Conference on Advanced Data Mining and Applications, ADMA 2016 - Gold Coast, Australia
Duration: 12 Dec 201615 Dec 2016

Publication series

NameLecture Notes in Artificial Intelligence
Volume10086
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other12th International Conference on Advanced Data Mining and Applications, ADMA 2016
Country/TerritoryAustralia
CityGold Coast
Period12/12/1615/12/16

Keywords

  • Source code mining
  • Topic model
  • Word embedding

Fingerprint

Dive into the research topics of 'Mining source code topics through topic model and words embedding'. Together they form a unique fingerprint.

Cite this