TY - GEN
T1 - Hyper-parameter optimization for privacy-preserving record linkage
AU - Yu, Joyce
AU - Nabaglo, Jakub
AU - Vatsalan, Dinusha
AU - Henecka, Wilko
AU - Thorne, Brian
PY - 2020
Y1 - 2020
N2 - Linkage of records that refer to the same entity across different databases finds applications in several areas, including healthcare, business, national security, and government services. In the absence of unique identifiers, quasi-identifiers (e.g. name, age, address) must be used to identify records of the same entity in different databases. These quasi-identifiers (QIDs) contain personal identifiable information (PII). Therefore, record linkage must be conducted in a way that preserves privacy. Using Cryptographic Long-term Key (CLK)-based encoding is one popular privacy-preserving record linkage (PPRL) technique where different QIDs are encoded independently into a representation that preserves records’ similarity but obscures PII. To achieve accurate results, the parameters of a CLK encoding must be tuned to suit the data. To this end, we study a Bayesian optimization method for effectively tuning hyper-parameters for CLK-based PPRL. Moreover, ground-truth labels (match or non-match) would be useful for evaluating linkage quality in the optimization, but they are often difficult to access. We address this by proposing an unsupervised method that uses heuristics to estimate linkage quality. Finally, we investigate the information leakage risk with the iterative approach of optimization methods and discuss recommendations to mitigate the risk. Experimental results show that our method requires fewer iterations to achieve good linkage results compared to two baseline optimization methods. It not only improves linkage quality and computational efficiency of hyper-parameter optimization, but also reduces the privacy risk.
AB - Linkage of records that refer to the same entity across different databases finds applications in several areas, including healthcare, business, national security, and government services. In the absence of unique identifiers, quasi-identifiers (e.g. name, age, address) must be used to identify records of the same entity in different databases. These quasi-identifiers (QIDs) contain personal identifiable information (PII). Therefore, record linkage must be conducted in a way that preserves privacy. Using Cryptographic Long-term Key (CLK)-based encoding is one popular privacy-preserving record linkage (PPRL) technique where different QIDs are encoded independently into a representation that preserves records’ similarity but obscures PII. To achieve accurate results, the parameters of a CLK encoding must be tuned to suit the data. To this end, we study a Bayesian optimization method for effectively tuning hyper-parameters for CLK-based PPRL. Moreover, ground-truth labels (match or non-match) would be useful for evaluating linkage quality in the optimization, but they are often difficult to access. We address this by proposing an unsupervised method that uses heuristics to estimate linkage quality. Finally, we investigate the information leakage risk with the iterative approach of optimization methods and discuss recommendations to mitigate the risk. Experimental results show that our method requires fewer iterations to achieve good linkage results compared to two baseline optimization methods. It not only improves linkage quality and computational efficiency of hyper-parameter optimization, but also reduces the privacy risk.
KW - Bayesian optimization
KW - Bloom filters
KW - Heuristic measures
KW - Hyper-parameters
KW - Information leakage risk
KW - Unsupervised
UR - http://www.scopus.com/inward/record.url?scp=85101313024&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-65965-3_18
DO - 10.1007/978-3-030-65965-3_18
M3 - Conference proceeding contribution
SN - 9783030659646
T3 - Communications in Computer and Information Science
SP - 281
EP - 296
BT - ECML PKDD 2020 Workshops
A2 - Koprinska, Irena
A2 - Appice, Annalisa
A2 - Antonie, Luiza
A2 - Guidotti, Riccardo
A2 - Ribeiro, Rita P.
A2 - Gama, João
A2 - Krishnamurthy, Yamuna
A2 - Malerba, Donato
A2 - Ceci, Michelangelo
A2 - Masciari, Elio
A2 - Christen, Peter
A2 - Schubert, Erich
A2 - Monreale, Monreale
A2 - Rinzivillo, Salvatore
A2 - Lommatzsch, Andreas
A2 - Kamp, Michael
A2 - Loglisci, Corrado
A2 - Zimmermann, Albrecht
A2 - Özgöbek, Özlem
A2 - Gavaldà, Ricard
A2 - Adilova, Linara
A2 - Ferreira, Pedro M.
A2 - Medeiros, Ibéria
A2 - Manco, Giuseppe
A2 - Ras, Zbigniew W.
A2 - Ntoutsi, Eirini
A2 - Zimek, Arthur
A2 - Biecek, Przemyslaw
A2 - Kille, Benjamin
A2 - Gulla, Jon Atle
PB - Springer, Springer Nature
CY - Cham, Switzerland
T2 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2020
Y2 - 14 September 2020 through 18 September 2020
ER -