CTextEM: using consolidated textual data for entity matching

Qiang Yang, Zhixu Li*, Binbin Gu, An Liu, Guanfeng Liu, Pengpeng Zhao, Lei Zhao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between the various sub-topics in CText. In this paper, we work on employing CText in EM. A baseline algorithm identifying important phrases with high IDF scores from CTexts and then measuring the similarity between CTexts based on these phrases does not work well since it estimates the similarity in one dimension and neglects that these phrases belong to different topics. To this end, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. Our empirical study on two real-world data set shows that our method outperforms the state-of-the-art EM methods and Text Understanding models by reaching a higher EM precision and recall.

Original languageEnglish
Title of host publicationDatabase Systems for Advanced Applications
Subtitle of host publication21st International Conference, DASFAA 2016, Proceedings, Part I
EditorsShamkant B. Navathe, Weili Wu, Shashi Shekhar, Xiaoyong Du, X. Sean Wang, Hui Xiong
PublisherSpringer, Springer Nature
Pages117-132
Number of pages16
Volume9642
ISBN (Electronic)9783319320250
ISBN (Print)9783319320243
DOIs
Publication statusPublished - 2016
Externally publishedYes
Event21st International Conference on Database Systems for Advanced Applications, DASFAA 2016 - Dallas, United States
Duration: 16 Apr 201619 Apr 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9642
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other21st International Conference on Database Systems for Advanced Applications, DASFAA 2016
Country/TerritoryUnited States
CityDallas
Period16/04/1619/04/16

Keywords

  • Consolidated textual data
  • CTextEM
  • Entity Matching
  • IDF score
  • Interaction
  • Sub-topic

Fingerprint

Dive into the research topics of 'CTextEM: using consolidated textual data for entity matching'. Together they form a unique fingerprint.

Cite this