De-Identification Software Package 1.1
(10,694 bytes)
De-Identification Version 1.1 README
=====================================
Name:
Automated De-Identification of Free-Text Medical Records
Purpose:
This software de-identifies protected health information (PHI) from
free-text medical records and outputs the de-identified text.
Authors:
Margaret Douglass
William J. Long
Ishna Neamatullah (ishna AT alum DOT mit DOT edu)
Li-wei Lehman (lilehman AT alum DOT mit DOT edu)
Last modified by Li-wei Lehman, April 2009
License: GNU GPL 2.0 : See file called "COPYING"
Version: 1.1
Background literature:
1. Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ,
Szolovits P, Moody GB, Mark RG, and Clifford GD. Automated de-identification
of free-text medical records. BMC Med Inform Decis Mak 2008;8(32). URL
http://www.biomedcentral.com/1472-6947/8/32/.
2. Neamatullah I. Automated De-Identification of Free-Text Medical
Records. MIT Press, 77 Mass. Ave., Cambridge, MA, 2006. MEng
Thesis.
3. Douglass M. Computer-Assisted De-identification of Free-text
Nursing Notes. MIT Press, 77 Mass. Ave., Cambridge, MA, USA,
2005. MEng Thesis.
4. Douglass M, Clifford GD, Reisner A, Long WJ, Moody GB, Mark
RG. De-Identification Algorithm for Free-Text Nursing
Notes. Computers In Cardiology, S6.2, 2005.
5. Douglass M, Clifford GD, Reisner A, Moody GB, Mark
RG. Computer-Assisted Deidentification of Free Text in the MIMIC II
Database. Computers In Cardiology, M6.2, 2004.
Platforms:
Perl 5.8.8 and Perl 5.10, Fedora Core 10, Linux 2.6.27 (development and testing).
The code is expected to run on Windows but is unsupported.
Code organization:
README.txt --- This file
Changes.log -- Documentation of changes since version 1.0.
deid.pl --- Source code in perl to de-identify medical notes
deid.config --- An example config file to run the perl code in performance comparison mode
deid-output.config -- An example config file to run the perl code in output mode
id.text --- Gold standard corpus with 2,434 re-identified nursing notes
id.deid --- List of PHI locations in the gold std corpus (id.text)
id.types --- Category of PHIs in id.deid
id-phi.phrase --- List of PHI locations and the PHI terms as appeared in text
shift.txt --- The date shift file for patients in the gold standard corpus
lists/ --- Directory containing dictionary/database of potential PHIs
dict/ --- Directory containing dictionary of common words or UMLS terms
docs/DeidUserManual.doc --- More documentation on the deid software.
The source code is contained in a single file (deid.pl). Each run can
be configured using deid.config. Associated dictionaries and database
used are in folders /lists and /dict.
The shift.txt file contains a randomly assigned date shift (between
1000 - 3000 days) for each patient in the gold standard corpus. If
the date shift filter is on, the dates will be shifted by the
specified number of days.
Note: the date shift in shift.txt is randomly generated for this
public release, and is different than what is used internally to
re-identify our medical notes. The per-patient date shifts used in
re-identifying dates in our medical notes are generated to preserve
the day of the week or season information in the medical notes.
The id-phi.phrase is not used by the deid code. It is for users to
see the text corresponding to each PHI location in the gold standard
corpus. Its format is <PID> <Record_Number> <PHI_Start_Location>
<PHI_End_Location> <PHI_Type> <PHI_Text>.
Installation:
Use "gunzip" to unzip the gzipped file, then unpack the tar file with
the "tar -xvf" command.
Testing:
To allow testing of the algorithm's execution, we have provided a
text-file, id.text, with an associated gold standard id.deid.
You can run the perl code in two different modes: (1) output mode
without performance statistics, in which case the program will output
the de-identified text (2) performance statistics mode, in which case,
the program will compare the PHI list generated by the code with the
PHI list from the gold standard, and output performance statistics.
Note: in either mode, it takes approximately 10 minutes to complete
a run on a 3 GHz dual Pentium 4 processor.
Test code WITHOUT performance statistics (i.e., in output mode):
================================================================
1. Configure the run using deid.config.
a) Comparison with Gold Standard: Set "Gold Standard Comparison" to
'0' for output mode (without performance statistics).
b) Date shifting: Enter number of days of forward shift or supply a
shift.txt file and specify "y" (for yes) for "PID to date offset
mapping". The default setting in deid-output.config for date shifting
is 1000 days for all notes (without using shift.txt). Use the
supplied shift.txt in order to assign patient-specific date shift.
c) PHI filters: Enter which filters should be turned on/off.
d) Dictionary filters: Enter which dictionaries should be loaded.
2. Run deid code on id.text: type "perl deid.pl id deid-output.config"
Configure the run using deid-output.config.
a) The input filename should have extension .text, but should be
entered in the command without the extension.
b) The code will output id.res, which is the scrubbed medical text
with the PHIs removed and replaced with appropriate tags.
3.Open id.res to examine de-identified output.
Test results:
1. id.res = de-identified text with PHI removed.
2. id.phi = PHI locations in text.
3. id.info = information on PHI locations and de-identification process
for debugging purposes.
Since the gold standard corpus (id.text) does not supply record date
for each nursing note, the code uses a default record date when
performing date shift (see .config file on how to specify the default
date). If you would like the deid code to date shift the dates within
your medical records properly, you need to supply a different record
date for each note. Please see DeidUserManual.doc for more details.
Sample output files from this mode can be found at the directory ./GSoutput.
To verify that the resulting output files you generated from running the code
is the same as the ones we provide you in the ./GSoutput directory, you can
use the 'diff' command on unix/linux.
When running the code in output mode, you should see the following message
output to the screen.
*******************************************************************************************************************
De-Identification Algorithm: Identifies Protected Health Information (PHI) in Discharge Summaries and Nursing Notes
*******************************************************************************************************************
Starting de-identification...
Running deid in output mode. Output files will be:
id.phi: the PHI locations found by the code.
id.text: the scrubbed text.
id.info: debug info about the PHI locations.
Test code with performance statistics:
======================================
1. Configure the run using deid.config.
a) Comparison with Gold Standard: set 'Comparison with Gold Standard'
to '1' for performance statistics. The gold standard PHI
locations are in id.deid. Statistics will be printed on the screen
at the end of the run.
b) Turn off the following lists: Country names and Ethnicities
(Note: this should be done for you already in deid.config)
Make sure file id.deid is in the same directory. Deid will evaluate
the program output using the PHI locations in id.deid as a gold
standard.
2. Run deid code on id.text: type "perl deid.pl id deid.config"
The filename should have extension .text, but should be entered in the
command without the extension.
3. Performance statistics will be printed on screen.
Test results:
1. id.phi = PHI locations in text.
2. id.info = information on PHI locations and de-identification
process for debugging purposes.
3. Performance statistics printed on screen.
4. Note that no id.res is created. In order to create this file, the
code has to be run with the 'Comparison with Gold Standard' option set
to '0'.
Sample output files from this mode can be found at the directory
./GSstats. The software reports sensitivity (or recall) and positive
prediction value (PPV or precision) of the output from
software. Sensitivity/Recall is defined as the proportion of PHI
identified by the software out of all instances of PHI in the text.
PPV/Precision is the proportion of true positivies of all terms
identified as PHI in the software.
When running the code in performance comparison mode, you should see
the following output on your screen.
*******************************************************************************************************************
De-Identification Algorithm: Identifies Protected Health Information (PHI) in Discharge Summaries and Nursing Notes
*******************************************************************************************************************
Starting de-identification (version 1.1) ...
Running deid in performance comparison mode.
Using PHI locations in id.deid as comparison. Output files will be:
id.phi: the PHI locations found by the code.
id.info: debug info about the PHI locations.
==========================
Num of true positives = 1720
Num of false positives = 546
Num of false negatives = 59
Sensitivity/Recall = 0.967
PPV/Specificity = 0.748
==========================
Customizing DeID to Work with Other Notes
==========================================
In order to customize this de-identification software to work with
notes in other applications, you can customize by replacing our filter
modules with your application-specific filters. Additionally, at a
minimum, you will have to replace the following dictionary files:
* lists/pid_patientname.txt
* lists/stripped_hospitals.txt
* lists/local_places_ambig.txt
* lists/local_places_unambig.txt
* lists/doctor_first_names.txt
* lists/doctor_last_names.txt
Depending on your applications, you may wish to re-classify names as
ambiguous or not. For example, while in most applications, the word
"Mae" is an un-ambiguous name, in nursing and discharge notes,
however, the word also means "moving all extremities" and therefore is
an ambiguous term.
In case of problems contact:
Li-wei Lehman (lilehman AT alum DOT mit DOT edu).
Gari Clifford (gari AT mit DOT edu)