De-identification of clinical data: a systematic review of free text, image and tabular data approaches

Pedro Faustini*, Annabelle McIver, Ryan Sullivan, Mark Dras

*Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

Abstract

Background: The digitisation of healthcare has generated vast amounts of data in various formats, including free-text notes, tabular records and medical images. This data is critical for research and innovation, but often contains sensitive information that must be de-identified to ensure patient privacy and regulatory compliance. Natural Language Processing (NLP) enables automated de-identification of sensitive information to safely share medical datasets. Objective: This study aims to systematically review the literature on NLP-based de-identification techniques applied to free-text medical reports, tabular data, and burned-in text within medical images over the past decade. It seeks to identify state-of-the-art methods, analyse how de-identification tasks are assessed, and find existing gaps for future research. Methods: We systematically searched five important databases (PubMed, Web of Science, DBLP, ACM and IEEE) for articles published from January 2015 to December 2024 (10 years) about de-identification of medical data in free text, tabular data and burned-in pixels in images. We filtered the articles based on their titles and abstracts against inclusion and exclusion criteria, followed by a quality filter. Results: From a set of 734 papers, 83 articles were deemed relevant. Most studies de-identify free text, with a few working with tabular data and a much scarcer number dealing with text embedded in the pixels of the images. Conclusions: De-identification techniques have evolved, with increased use of Language Models and a decline in recurrence-based neural networks. Off-the-shelf tools often require customisation for optimal performance. Most studies de-identify English content, supported by the prevalence of English datasets. Key challenges include the phenomenon of code-mixing (i.e., more than one language used in the same sentence) and the scarcity of available datasets for reproducibility.

Original languageEnglish
Article number106225
Pages (from-to)1-14
Number of pages14
JournalInternational Journal of Medical Informatics
Volume208
Early online date19 Dec 2025
DOIs
Publication statusE-pub ahead of print - 19 Dec 2025

Bibliographical note

Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Clinical data
  • De-identification
  • Systematic review

Fingerprint

Dive into the research topics of 'De-identification of clinical data: a systematic review of free text, image and tabular data approaches'. Together they form a unique fingerprint.

Cite this