Detecting duplicate posts in programming QA communities via latent semantics and association rules

Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, Ermyas Abebe

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

26 Citations (Scopus)
127 Downloads (Pure)


Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection methodologies from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.
Original languageEnglish
Title of host publicationProceedings of the 26th International Conference on World Wide Web
Number of pages9
Publication statusPublished - 2017
EventInternational World Wide Web Conference Committee (26th : 2017) - Perth, Australia
Duration: 3 Apr 20177 Apr 2017


ConferenceInternational World Wide Web Conference Committee (26th : 2017)

Bibliographical note

Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.


  • community-based question answering
  • latent semantics
  • association rules
  • question quality
  • classification
  • Community-based question answering
  • Latent semantics
  • Classification
  • Question quality
  • Association rules

Fingerprint Dive into the research topics of 'Detecting duplicate posts in programming QA communities via latent semantics and association rules'. Together they form a unique fingerprint.

Cite this