Record matching with non-key attribute values

Qiang Yang, Zhi Xu Li*, Jun Jiang, Peng Peng Zhao, Guan Feng Liu, An Liu, Xiao Fang Zhou

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Record Matching (RM) finds out instances referring to the same entity between different data sources. Existing work mainly uses the similarity between the key attribute values of instances for RM, while seldom work employs non-key attribute values. As a result, when two instances referring to the same entity do not have similar key attribute values, they might be missed as a matching pair. On the other hand, some particular non-key attribute values shared by the two instances might reflect the relationship between them. Based on the intuition, we propose a novel RM method based on non-key attribute values. Compared to key attribute, non-key attributes can be more noisy and inconsistent. Besides, there are usually a lot more non-key attributes than key attributes, thus RM based on non-key attributes faces a significant efficiency problem. To deal with these challenges, we propose a rule-based algorithm based on a tree-like structure. With this tree-like structure, we can not only deal with noisy and missing values, but also greatly improve the efficiency of the method by finding out matched instances or filtering unmatched instances as early as possible. The experimental results based on several data sets demonstrate that our method outperforms existing RM methods by reaching a higher precision and recall. Besides, the proposed techniques can greatly improve the efficiency of a baseline algorithm beyond 10 times.

Original languageEnglish
Pages (from-to)2075-2087
Number of pages13
JournalJisuanji Xuebao/Chinese Journal of Computers
Volume39
Issue number10
DOIs
Publication statusPublished - 1 Oct 2016
Externally publishedYes

Keywords

  • algorithm
  • data quality
  • non-key attribute
  • performance
  • record matching

Cite this