Abstract
Synthetic data methods enable sharing sensitive datasets while safeguarding privacy. This paper introduces the Local Resampler (LR) framework, implemented in the open-source Python package synloc, which offers two neighbourhood-selection strategies for synthetic data generation: k -nearest neighbour ( k NN-LR) and constrained K-Means clustering (C-LR). The central theoretical contribution is the formal demonstration that k NN-LR systematically underrepresents outliers through an inherent statistical property, providing algorithmic disclosure control without manual intervention. Comprehensive empirical evaluations demonstrate that while k NN-LR achieves stronger privacy protection, C-LR excels at preserving marginal distributions and correlational structures. Simulations across multimodal, non-convex-support and skewed distributions show that LR methods match or exceed in performance established synthesisers while requiring minimal computational demand. Both LR variants offer regulatory compliance through parametric modelling aligned with the k -anonymity framework. The explicit utility–privacy trade-off, controlled through LR hyperparameters, enables users to tailor synthetic data generation to specific disclosure control requirements.
| Original language | English |
|---|---|
| Number of pages | 24 |
| Journal | Australian and New Zealand Journal of Statistics |
| DOIs | |
| Publication status | E-pub ahead of print - 13 Dec 2025 |
Keywords
- clustering algorithms
- computational statistics
- k-nearest neighbours
- statistical disclosure control
- synthetic data