Generating synthetic data with locally estimated distributions for disclosure control

Research output: Contribution to journalArticlepeer-review

Abstract

Synthetic data methods enable sharing sensitive datasets while safeguarding privacy. This paper introduces the Local Resampler (LR) framework, implemented in the open-source Python package synloc, which offers two neighbourhood-selection strategies for synthetic data generation: k -nearest neighbour ( k NN-LR) and constrained K-Means clustering (C-LR). The central theoretical contribution is the formal demonstration that k NN-LR systematically underrepresents outliers through an inherent statistical property, providing algorithmic disclosure control without manual intervention. Comprehensive empirical evaluations demonstrate that while k NN-LR achieves stronger privacy protection, C-LR excels at preserving marginal distributions and correlational structures. Simulations across multimodal, non-convex-support and skewed distributions show that LR methods match or exceed in performance established synthesisers while requiring minimal computational demand. Both LR variants offer regulatory compliance through parametric modelling aligned with the k -anonymity framework. The explicit utility–privacy trade-off, controlled through LR hyperparameters, enables users to tailor synthetic data generation to specific disclosure control requirements.
Original languageEnglish
Number of pages24
JournalAustralian and New Zealand Journal of Statistics
DOIs
Publication statusE-pub ahead of print - 13 Dec 2025

Keywords

  • clustering algorithms
  • computational statistics
  • k-nearest neighbours
  • statistical disclosure control
  • synthetic data

Fingerprint

Dive into the research topics of 'Generating synthetic data with locally estimated distributions for disclosure control'. Together they form a unique fingerprint.

Cite this