Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control

Research output: Working paperPreprint

Abstract

Sensitive datasets are often underutilized in research and industry due to privacy concerns, limiting the potential of valuable data-driven insights. Synthetic data generation presents a promising solution to address this challenge by balancing privacy protection with data utility. This paper introduces a new approach to mitigate privacy risks associated with outlier observations in synthetic datasets: the Local Resampler (LR). The LR leverages the k-nearest neighbors algorithm to generate synthetic data while minimizing disclosure risks by underrepresenting outliers, even when they are not detectable in marginal distributions. Theoretical and empirical analyses demonstrate that the LR effectively mitigates outlier-driven disclosure risks, and accurately replicates multimodal, skewed, and non-convex support distributions. The semiparametric nature of the LR ensures a low computational burden and works efficiently even with small samples. By parameterizing the balance between privacy risks and data utility, this approach promotes broader access to sensitive datasets for research.
Original languageEnglish
PublisherarXiv.org
DOIs
Publication statusSubmitted - 2025

Publication series

NamearXiv

Fingerprint

Dive into the research topics of 'Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control'. Together they form a unique fingerprint.

Cite this