Abstract
Record Matching (RM) finds out instances referring to the same entity between different data sources. Existing work mainly uses the similarity between the key attribute values of instances for RM, while seldom work employs non-key attribute values. As a result, when two instances referring to the same entity do not have similar key attribute values, they might be missed as a matching pair. On the other hand, some particular non-key attribute values shared by the two instances might reflect the relationship between them. Based on the intuition, we propose a novel RM method based on non-key attribute values. Compared to key attribute, non-key attributes can be more noisy and inconsistent. Besides, there are usually a lot more non-key attributes than key attributes, thus RM based on non-key attributes faces a significant efficiency problem. To deal with these challenges, we propose a rule-based algorithm based on a tree-like structure. With this tree-like structure, we can not only deal with noisy and missing values, but also greatly improve the efficiency of the method by finding out matched instances or filtering unmatched instances as early as possible. The experimental results based on several data sets demonstrate that our method outperforms existing RM methods by reaching a higher precision and recall. Besides, the proposed techniques can greatly improve the efficiency of a baseline algorithm beyond 10 times.
Original language | English |
---|---|
Pages (from-to) | 2075-2087 |
Number of pages | 13 |
Journal | Jisuanji Xuebao/Chinese Journal of Computers |
Volume | 39 |
Issue number | 10 |
DOIs | |
Publication status | Published - 1 Oct 2016 |
Externally published | Yes |
Keywords
- algorithm
- data quality
- non-key attribute
- performance
- record matching