What is the patient re-identification risk from using de-identified clinical free text data for health research?

Authors: Elizabeth Ford, Simon Pillinger, Robert Stewart, Kerina Jones, Angus Roberts, Arlene Casey, Kasey Goddard, Goran Nenadic.

Published online on 26 February, 2025.

Link to full publication on Springer: What is the patient re-identification risk from using de-identified clinical free text data for health research?

Abstract

Important clinical information is recorded in free text in patients’ records, notes, letters and reports in healthcare settings.
This information is currently under-used for health research and innovation. Free text requires more processing for analysis than structured data, but processing natural language at scale has recently advanced, using large language models.
However, data controllers are often concerned about patient privacy risks if clinical text is allowed to be used in research.

Text can be de-identified, yet it is challenging to quantify the residual risk of patient re-identification. This paper presents
a comprehensive review and discussion of elements for consideration when evaluating the risk of patient re-identification
from free text. We consider:

(1) the reasons researchers want access to free text;

(2) the accuracy of automated de-identification processes, identifying best practice;

(3) methods previously used for re-identifying health data and their success;

(4) additional protections put in place around health data, particularly focussing on the UK where “Five Safes” secure data
environments are used;

(5) risks of harm to patients from potential re-identification and

(6) public views on free text being used for research.

We present a model to conceptualise and evaluate risk of re-identification, accompanied by case studies
of successful governance of free text for research in the UK. When de-identified and stored in secure data environments,
the risk of patient re-identification from clinical free text is very low. More health research should be enabled by routinely
storing and giving access to de-identified clinical text data.

Link to full publication on Springer: What is the patient re-identification risk from using de-identified clinical free text data for health research?

Keywords:

Natural language processing · Clinical text · Health data · Data science · Privacy · Data governance · Confidentiality