Feature analysis for duplicate detection in programming QA communities

Wei Emma Zhang*, Quan Z. Sheng, Yanjun Shu, Vanh Khuyen Nguyen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

8 Citations (Scopus)

Abstract

In community question answering (CQA), duplicate questions are questions that were previously created and answered but occur again. These questions produce noises in the CQA websites which impede users to find answers efficiently. Programming CQA (PCQA), a branch of CQA that holds questions related to programming, also suffers from this problem. Existing works on duplicate detection in PCQA websites framed the task as a supervised learning task on the question pairs, and relied on a number of extracted features of the question pairs. But they extracted only textual features and did not consider the source code in the questions, which are linguistically very different to natural languages. Our work focuses on developing novel features for PCQA duplicate detection. We leverage continuous word vectors from the deep learning literature, probabilistic models in information retrieval and association pairs mined from duplicate questions using machine translation. We provide extensive empirical analysis on the performance of these features and their various combinations using a range of learning models. Our work could be helpful for both research works and practical applications that require extracting features from texts that are not all natural languages.

Original languageEnglish
Title of host publication13th International Conference on Advanced Data Mining and Applications : proceedings
Subtitle of host publicationADMA 2017
EditorsGao Cong, Chengliang Li, Wen-Chih Peng, Aixin Sun, Wei Emma Zhang
Place of PublicationCham, Switzerland
PublisherSpringer, Springer Nature
Pages623-638
Number of pages16
ISBN (Electronic)9783319691794
ISBN (Print)9783319691787
DOIs
Publication statusPublished - 2017
Event13th International Conference on Advanced Data Mining and Applications, ADMA 2017 - Singapore, Singapore
Duration: 5 Nov 20176 Nov 2017

Publication series

NameLecture Notes in Computer Science
Volume10604
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349
NameLecture Notes in Artificial Intelligence
Volume10604

Conference

Conference13th International Conference on Advanced Data Mining and Applications, ADMA 2017
Country/TerritorySingapore
CitySingapore
Period5/11/176/11/17

Keywords

  • Duplicate detection
  • Feature analysis
  • Question answering

Fingerprint

Dive into the research topics of 'Feature analysis for duplicate detection in programming QA communities'. Together they form a unique fingerprint.

Cite this