Abstract
Motivation: Machine-generated or synthetic data is a valuable resource for training artificial intelligence algorithms, evaluating rare workflows, and sharing data under stricter data legislations. However, current statistical and deep learning methods struggle with large data volumes, are prone to hallucinating scenarios incompatible with reality, and seldom quantify privacy meaningfully.
Results: Here, we introduce Genomator, a logic solving approach (SAT solving), which efficiently produces private and realistic representations of the original data. We demonstrate the method on genomic data, which arguably is the most complex and private information. We benchmark Genomator against state-of-the-art methodologies (Markov generation, Wasserstein Generative Adversarial Network and Conditional Restricted Boltzmann Machines), demonstrating a 40%-530% accuracy improvement and 57%-172% higher privacy. Genomator is also 3-100 times more efficient, making it the only tested method that scales to whole genomes. We show the universal trade-off between privacy and accuracy, and use Genomator’s tuning capability to cater to all applications along the spectrum, from provable private representations of sensitive cohorts, to datasets with indistinguishable pharmacogenomic profiles. Demonstrating the production-scale generation of tuneable synthetic genomes hold great potential for balancing underrepresented populations in medical research and advancing global data exchange.
Availability and implementation: Genomator is available at https://github.com/csiro/genomator.
| Original language | English |
|---|---|
| Article number | btaf600 |
| Pages (from-to) | 1-10 |
| Number of pages | 10 |
| Journal | Bioinformatics |
| Volume | 41 |
| Issue number | 12 |
| Early online date | 4 Nov 2025 |
| DOIs | |
| Publication status | Published - Dec 2025 |
Bibliographical note
Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.Fingerprint
Dive into the research topics of 'Privacy-hardened and hallucination-resistant synthetic data generation with logic-solvers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver