Citation enrichment improves deduplication of primary evidence

Miew Keen Choong*, Sarah Thorning, Guy Tsafnat

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review


Objective: To automatically detect duplicate citations in a bibliographical database. Background: Citations retrieved from multiple search databases have different forms making manual and automatic detection of duplicates difficult. Existing methods rely on fuzzy-similarity measures which are error-prone. Methods: We analysed four pairs of original search results from MEDLINE and EMBASE that were used to create systematic reviews. An automatic tool deduplicated citations by first enriching citations with Digital Object Identifiers (DOI), and/or other unique identifiers. Duplication of records was then determined by comparing these unique identifiers. We compared our method with the duplicate detection function of a popular citation management desktop application in several configurations. Results: Citation Enrichment identified 93 % (range 86 %–100 %) of the duplicates indexed online and erroneously marked 3 % (range 0 %–6 %) documents as duplicates. The citation management application found 68 % (range 64 %–72 %) without error using default setting. When set for highest deduplication, the citation management application found 94 % of duplicates (range 77 %–100 %) and 4 % error (range 0 %–8 %). Conclusion: Citation enrichment using unique identifiers enhances automatic deduplication. On its own, the approach seems slightly superior to tools that compare citations without enrichment. Methods that combine citation enrichment with existing fuzzy-matching may substantially reduce resource requirements of evidence synthesis.

Original languageEnglish
Title of host publicationTrends and Applications in Knowledge Discovery and Data Mining, PAKDD 2015 Workshops: BigPMA, VLSP, QIMIE, DAEBH, Revised Selected Papers
EditorsXiao-Li Li, Tru Cao, Ee-Peng Lim, Zhi-Hua Zhou, Tu-Bao Ho, David Cheung, Hiroshi Motoda
Place of PublicationCham
PublisherSpringer, Springer Nature
Number of pages8
ISBN (Electronic)9783319256603
ISBN (Print)9783319256597
Publication statusPublished - 2015
Event19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015 - Ho Chi Minh City, Viet Nam
Duration: 19 May 201519 May 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN (Print)03029743
ISSN (Electronic)16113349


Other19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015
Country/TerritoryViet Nam
CityHo Chi Minh City


Dive into the research topics of 'Citation enrichment improves deduplication of primary evidence'. Together they form a unique fingerprint.

Cite this