Scalable and Robust Clinical Text De-Identification Tools

Biography

contributor

Principal Investigator

Overview

abstract

Exploiting the full potential of information rich and rapidly growing repositories of patient clinical text is hampered by the absence of scalable and robust de-identification tools. Clinical text contains protected health information (PHI), and the Health Insurance Portability and Accountability Act (HIPAA) restricts research use of patient information containing PHI to specific, limited, IRB-approved projects. As a result, vast repositories of clinical text remain under-used by internal researchers, and are even less available for external transmission to outside collaborators or for centralized processing by state-of-the-art natural language processing (NLP) technologies. De-identification, which is the removal of PHI from clinical text, is challenging. Despite their availability for over a decade, commercially available automated systems are expensive, require local tailoring, and have not gained widespread market penetration. Manual methods are costly and do not scale, yet continue to be used despite the small amount of residual PHI they leave behind. Open source de-identification tools based on state-of-the-art machine learning technologies can perform at or above the level of manual approaches but also suffer from the residual PHI problem. Current de-identification approaches, then, also severely limit the use and mobility of clinical text while exposing patients to privacy risks. These approaches redact PHI, blacking it out or replacing it with symbols (e.g., Here for cardiac eval is Mr. **PT_NAME, a **AGE<60s> yo male with his son Doug ...). Traditional approaches leave residual PHI (Doug in this example) to be easily noticed by readers of the text, as it remains plainly visible among the prominent redactions. We developed and pilot tested an alternative approach we believe addresses the residual PHI problem. Our approach uses the strategy of concealing, rather than trying to eliminate, residual PHI. We call it the Hiding In Plain Sight (HIPS) approach. HIPS replaces all known PHI with surrogate PHI- fictional names, ages, etc.-that look real but do not refer to any actual patient. A HIPS version of the above text is: Here for cardiac eval is Mr. Jones, a 64 yo male with his son Doug ... where the name Jones and age 64 are fictional surrogates, but the name Doug is residual PHI. To a reader, the surrogates and the residual PHI are indistinguishable. This prevents the reader from detecting the latter, avoiding disclosure. Our preliminary studies suggest that HIPS can reduce the risk of disclosure of residual PHI by a factor of 10. This yields overall performance that far surpasses the performance attainable by manual methods, and is unlikely to be matched, we believe, by additional incremental improvements in PHI tagging models (i.e., efforts to reduce residual PHI). Our pilot studies indicate IRBs would welcome the HIPS approach if it were shown to be effective through rigorous evaluation. To expand usage of clinical text and enhance patient privacy, we propose to formalize rules of effective surrogate generation (Aim 1), extend related de-identification confidence scoring methods (Aim 2), and conduct rigorous efficacy testing of HIPS in diverse institutional settings (Aim 3).

sponsor award id

R01LM011366

Time

start date

2012-09-01

end date

2016-08-31