Hyper-parameter optimization for privacy-preserving record linkage

Joyce Yu*, Jakub Nabaglo, Dinusha Vatsalan, Wilko Henecka, Brian Thorne

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

2 Citations (Scopus)

Abstract

Linkage of records that refer to the same entity across different databases finds applications in several areas, including healthcare, business, national security, and government services. In the absence of unique identifiers, quasi-identifiers (e.g. name, age, address) must be used to identify records of the same entity in different databases. These quasi-identifiers (QIDs) contain personal identifiable information (PII). Therefore, record linkage must be conducted in a way that preserves privacy. Using Cryptographic Long-term Key (CLK)-based encoding is one popular privacy-preserving record linkage (PPRL) technique where different QIDs are encoded independently into a representation that preserves records’ similarity but obscures PII. To achieve accurate results, the parameters of a CLK encoding must be tuned to suit the data. To this end, we study a Bayesian optimization method for effectively tuning hyper-parameters for CLK-based PPRL. Moreover, ground-truth labels (match or non-match) would be useful for evaluating linkage quality in the optimization, but they are often difficult to access. We address this by proposing an unsupervised method that uses heuristics to estimate linkage quality. Finally, we investigate the information leakage risk with the iterative approach of optimization methods and discuss recommendations to mitigate the risk. Experimental results show that our method requires fewer iterations to achieve good linkage results compared to two baseline optimization methods. It not only improves linkage quality and computational efficiency of hyper-parameter optimization, but also reduces the privacy risk.

Original languageEnglish
Title of host publicationECML PKDD 2020 Workshops
Subtitle of host publicationWorkshops of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020): SoGood 2020, PDFL 2020, MLCS 2020, NFMCP 2020, DINA 2020, EDML 2020, XKDD 2020 and INRA 2020 Ghent, Belgium, September 14–18, 2020 Proceedings
EditorsIrena Koprinska, Annalisa Appice, Luiza Antonie, Riccardo Guidotti, Rita P. Ribeiro, João Gama, Yamuna Krishnamurthy, Donato Malerba, Michelangelo Ceci, Elio Masciari, Peter Christen, Erich Schubert, Monreale Monreale, Salvatore Rinzivillo, Andreas Lommatzsch, Michael Kamp, Corrado Loglisci, Albrecht Zimmermann, Özlem Özgöbek, Ricard Gavaldà, Linara Adilova, Pedro M. Ferreira, Ibéria Medeiros, Giuseppe Manco, Zbigniew W. Ras, Eirini Ntoutsi, Arthur Zimek, Przemyslaw Biecek, Benjamin Kille, Jon Atle Gulla
Place of PublicationCham, Switzerland
PublisherSpringer, Springer Nature
Pages281-296
Number of pages16
ISBN (Electronic)9783030659653
ISBN (Print)9783030659646
DOIs
Publication statusPublished - 2020
Externally publishedYes
EventEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2020 - Ghent, Belgium
Duration: 14 Sept 202018 Sept 2020

Publication series

NameCommunications in Computer and Information Science
Volume1323
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

ConferenceEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2020
Country/TerritoryBelgium
CityGhent
Period14/09/2018/09/20

Keywords

  • Bayesian optimization
  • Bloom filters
  • Heuristic measures
  • Hyper-parameters
  • Information leakage risk
  • Unsupervised

Fingerprint

Dive into the research topics of 'Hyper-parameter optimization for privacy-preserving record linkage'. Together they form a unique fingerprint.

Cite this