Efficient interactive training selection for large-scale entity resolution

Qing Wang*, Dinusha Vatsalan, Peter Christen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

16 Citations (Scopus)

Abstract

Entity resolution (ER) has wide-spread applications in many areas, including e-commerce, health-care, the social sciences, and crime and fraud detection. A crucial step in ER is the accurate classification of pairs of records into matches (assumed to refer to the same entity) and non-matches (assumed to refer to different entities). In most practical ER applications it is difficult and costly to obtain training data of high quality and enough size, which impedes the learning of an ER classifier. We tackle this problem using an interactive learning algorithm that exploits the cluster structure in similarity vectors calculated from compared record pairs. We select informative training examples to assess the purity of clusters, and recursively split clusters until clusters pure enough for training are found. We consider two aspects of active learning that are significant in practical applications: a limited budget for the number of manual classifications that can be done, and a noisy oracle where manual labeling might be incorrect. Experiments using several real data sets show that manual labeling efforts can be significantly reduced for training an ER classifier without compromising matching quality.
Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining
Subtitle of host publication19th Pacific-Asia Conference, PAKDD 2015 Ho Chi Minh City, Vietnam, May 19–22, 2015 Proceedings, Part II
EditorsTru Cao, Ee-Peng Lim, Zhi-Hua Zhou, Tu-Bao Ho, David Cheung, Hiroshi Motoda
Place of PublicationCham, Switzerland
PublisherSpringer, Springer Nature
Pages562-573
Number of pages12
ISBN (Electronic)9783319180328
ISBN (Print)9783319180311
DOIs
Publication statusPublished - 2015
Externally publishedYes
Event19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015 - Ho Chi Minh City, Viet Nam
Duration: 19 May 201519 May 2015

Other

Other19th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2015
Country/TerritoryViet Nam
CityHo Chi Minh City
Period19/05/1519/05/15

Keywords

  • Data matching
  • Record linkage
  • Deduplication
  • Active learning
  • Noisy oracle
  • Hierarchical clustering
  • Interactive labeling

Fingerprint

Dive into the research topics of 'Efficient interactive training selection for large-scale entity resolution'. Together they form a unique fingerprint.

Cite this