Skip to main navigation Skip to search Skip to main content

Efficient record linkage using a compact hamming space

Dimitrios Karapiperis, Dinusha Vatsalan, Vassilios S. Verykios, Peter Christen

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

27 Downloads (Pure)

Abstract

Record linkage, the process of identifying similar records that correspond to the same real-world entities across databases, is a well-established research problem in the database, data mining, and information retrieval communities. Computing distances between string values of records is the key component in order to determine the similarity of the represented entities. Due to the typically large volumes of records, a two-step process is followed. A blocking mechanism is first applied for grouping similar records together, and then a matching mechanism is performed for comparing the records which have been inserted into the same block. However, there does not exist any efficient blocking/matching mechanism which provides theoretical guarantees for identifying similar records which consist of strings. Towards this end, we put forth the novel notion of embedding string-based records into a Hamming space, where such a mechanism exists. The size of these embeddings is kept as small as needed in order to guarantee the correspondence of distances in that space to the types of errors that exist between strings, e.g., a missing or a modified character. We build embeddings whose size is 120 bits for representing accurately four fields of a publicly available data set. We also present a distance threshold-aware blocking technique for higher accuracy rates compared to blocking approaches which ignore the specified threshold. Our empirical study conducted on real-world data sets shows the efficacy achieved by our embedding method as compared to several existing solutions.
Original languageEnglish
Title of host publicationAdvances in Database Technology — EDBT 2016
Subtitle of host publication19th International Conference on Extending Database Technology Bordeaux, France, March 15–18, 2016 Proceedings
EditorsEvaggelia Pitoura, Sofian Maabout, Georgia Koutrika, Amelie Marian, Letizia Tanca, Ioana Manolescu, Kostas Stefanidis
Place of PublicationKonstanz, Germany
PublisherOpenProceedings.org, University of Konstanz, University Library
Pages209-220
Number of pages12
ISBN (Electronic)9783893180707
DOIs
Publication statusPublished - 2016
Externally publishedYes
Event19th International Conference on Extending Database Technology, EDBT 2016 - Bordeaux, France
Duration: 15 Mar 201618 Mar 2016

Publication series

NameExtended Database Technology (EDBT) Conference Proceedings

Conference

Conference19th International Conference on Extending Database Technology, EDBT 2016
Country/TerritoryFrance
CityBordeaux
Period15/03/1618/03/16

Bibliographical note

Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Fingerprint

Dive into the research topics of 'Efficient record linkage using a compact hamming space'. Together they form a unique fingerprint.

Cite this