Abstract
Scalable data processing platforms built on cloud computing becomes increasingly attractive as infrastructure for supporting big data applications. But privacy concerns are one of the major obstacles to making use of public cloud platforms. Multidimensional anonymisation, a global-recoding generalisation scheme for privacy-preserving data publishing, has been a recent focus due to its capability of balancing data obfuscation and usability. Existing multidimensional anonymisation methods suffer from scalability problems when handling big data due to the impractical serial I/O cost. Given the recursive feature of multidimensional anonymisation, parallelisation is an ideal solution to scalability issues. However, it is still a challenge to use existing distributed and parallel paradigms directly for recursive computation. In this paper, we propose a scalable approach for big data multidimensional anonymisation based on MapReduce, a state-of-the-art data processing paradigm. Our basic idea is to partition a data set recursively into smaller partitions using MapReduce until all partitions can fit in the memory of a computing node. A tree indexing structure is proposed to achieve recursive computation. Moreover, we show the applicability of our approach to differential privacy. Experimental results on real-life data demonstrate that our approach can significantly improve the scalability of multidimensional anonymisation over existing methods.
Original language | English |
---|---|
Pages (from-to) | 125-139 |
Number of pages | 15 |
Journal | IEEE Transactions on Big Data |
Volume | 8 |
Issue number | 1 |
Early online date | 27 Dec 2017 |
DOIs | |
Publication status | Published - Feb 2022 |
Externally published | Yes |
Keywords
- Data privacy
- Privacy
- Scalability
- Cloud computing
- Partitioning algorithms
- Indexing
- Big data
- Privacy preservation
- MapReduce
- Data anonymisation
- Differential privacy