TY - JOUR
T1 - Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms
T2 - a case study on psychiatric evaluation notes
AU - Dehghan, Azad
AU - Kovacevic, Aleksandar
AU - Karystianis, George
AU - Keane, John A.
AU - Nenadic, Goran
PY - 2017/11
Y1 - 2017/11
N2 - De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of ∼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
AB - De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of ∼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
KW - Clinical text mining
KW - De-identification
KW - Electronic health record
KW - Information extraction
KW - Named entity recognition
UR - http://www.scopus.com/inward/record.url?scp=85020826982&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2017.06.005
DO - 10.1016/j.jbi.2017.06.005
M3 - Article
C2 - 28602908
AN - SCOPUS:85020826982
SN - 1532-0464
VL - 75
SP - S28-S33
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
IS - Supplement
ER -