TY - JOUR
T1 - Incremental clustering techniques for multi-party Privacy-Preserving Record Linkage
AU - Vatsalan, Dinusha
AU - Christen, Peter
AU - Rahm, Erhard
PY - 2020/7
Y1 - 2020/7
N2 - Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values. Employing PPRL on records from multiple (more than two) parties/sources (multi-party PPRL, MP-PPRL) is an increasingly important but challenging problem that so far has not been sufficiently solved. Existing MP-PPRL approaches are limited to finding only those entities that are present in all parties thereby missing entities that match only in a subset of parties. Furthermore, previous MP-PPRL approaches face substantial scalability limitations due to the need of a large number of comparisons between masked records. We thus propose and evaluate new MP-PPRL approaches that find matches in any subset of parties and still scale to many parties. Our approaches maintain all matches within clusters, where these clusters are incrementally extended or refined by considering records from one party after the other. An empirical evaluation using multiple real datasets ranging from 3 to 26 parties each containing up to 5 million records validates that our protocols are efficient, and significantly outperform existing MP-PPRL approaches in terms of linkage quality and scalability.
AB - Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values. Employing PPRL on records from multiple (more than two) parties/sources (multi-party PPRL, MP-PPRL) is an increasingly important but challenging problem that so far has not been sufficiently solved. Existing MP-PPRL approaches are limited to finding only those entities that are present in all parties thereby missing entities that match only in a subset of parties. Furthermore, previous MP-PPRL approaches face substantial scalability limitations due to the need of a large number of comparisons between masked records. We thus propose and evaluate new MP-PPRL approaches that find matches in any subset of parties and still scale to many parties. Our approaches maintain all matches within clusters, where these clusters are incrementally extended or refined by considering records from one party after the other. An empirical evaluation using multiple real datasets ranging from 3 to 26 parties each containing up to 5 million records validates that our protocols are efficient, and significantly outperform existing MP-PPRL approaches in terms of linkage quality and scalability.
KW - Data linkage
KW - Graph matching
KW - Multiple databases
KW - Privacy
KW - Scalability
KW - Subset matching
UR - http://www.scopus.com/inward/record.url?scp=85081903251&partnerID=8YFLogxK
UR - https://dataportal.arc.gov.au/NCGP/Web/Grant/Grant/DP130101801
UR - http://purl.org/au-research/grants/arc/DP160101934
U2 - 10.1016/j.datak.2020.101809
DO - 10.1016/j.datak.2020.101809
M3 - Article
SN - 0169-023X
VL - 128
SP - 1
EP - 19
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
M1 - 101809
ER -