Abstract
Background: The digitisation of healthcare has generated vast amounts of data in various formats, including free-text notes, tabular records and medical images. This data is critical for research and innovation, but often contains sensitive information that must be de-identified to ensure patient privacy and regulatory compliance. Natural Language Processing (NLP) enables automated de-identification of sensitive information to safely share medical datasets. Objective: This study aims to systematically review the literature on NLP-based de-identification techniques applied to free-text medical reports, tabular data, and burned-in text within medical images over the past decade. It seeks to identify state-of-the-art methods, analyse how de-identification tasks are assessed, and find existing gaps for future research. Methods: We systematically searched five important databases (PubMed, Web of Science, DBLP, ACM and IEEE) for articles published from January 2015 to December 2024 (10 years) about de-identification of medical data in free text, tabular data and burned-in pixels in images. We filtered the articles based on their titles and abstracts against inclusion and exclusion criteria, followed by a quality filter. Results: From a set of 734 papers, 83 articles were deemed relevant. Most studies de-identify free text, with a few working with tabular data and a much scarcer number dealing with text embedded in the pixels of the images. Conclusions: De-identification techniques have evolved, with increased use of Language Models and a decline in recurrence-based neural networks. Off-the-shelf tools often require customisation for optimal performance. Most studies de-identify English content, supported by the prevalence of English datasets. Key challenges include the phenomenon of code-mixing (i.e., more than one language used in the same sentence) and the scarcity of available datasets for reproducibility.
| Original language | English |
|---|---|
| Article number | 106225 |
| Pages (from-to) | 1-14 |
| Number of pages | 14 |
| Journal | International Journal of Medical Informatics |
| Volume | 208 |
| Early online date | 19 Dec 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 19 Dec 2025 |
Bibliographical note
Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.Keywords
- Clinical data
- De-identification
- Systematic review
Fingerprint
Dive into the research topics of 'De-identification of clinical data: a systematic review of free text, image and tabular data approaches'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver