Model Credentialed Access
Clinical BERT Models Trained on Pseudo Re-identified MIMIC-III Notes
Eric Lehman , Sarthak Jain , Karl Pichotta , Yoav Goldberg , Byron Wallace
Published: April 28, 2021. Version: 1.0.0
When using this resource, please cite:
(show more options)
Lehman, E., Jain, S., Pichotta, K., Goldberg, Y., & Wallace, B. (2021). Clinical BERT Models Trained on Pseudo Re-identified MIMIC-III Notes (version 1.0.0). PhysioNet. https://doi.org/10.13026/vk2z-bz63.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
This project contains the weights of 7 different BERT models trained over a surrogate re-identified set of MIMIC-III (v1.4) notes. Each model is suitable for direct compatibility with the HuggingFace framework; users can easily load and apply the model. We release these model weights with the intent of facilitating research into the dangers of revealing patient information via model sharing, specifically for models pretrained on non-deidentified electronic health record (EHR) data. We additionally release our post-processed data, which most importantly contains a mapping of patient subject id to condition(s) that the given patient has. The purpose of these "subject id to condition" files is to measure how much of a correlation can be extracted from a surrogate patient name and the medical conditions that the given patient has. This will allow users to quantify the amount of Protected Health Information (PHI) "leakage" these large pretrained language models may exhibit.
Background
Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT [1, 2]. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? While we were unable to meaningfully extract patient information using numerous simple probing techniques, more sophisticated methods may succeed in doing so [3]. In order to facilitate further research into the dangers of parameter sharing, we release several models that we trained on a pseudo re-identified set of MIMIC-III notes [4].
Model Description
This data released can be broken into two separate components: models and post-processed data. For each model, we release the corresponding vocab, configuration, and model weight files. Each model can be described as follows:
-
ClinicalBERT_1a
: initialized from BERT-Base and trained for 300K iterations at a sequence length of 128 and 100K iterations at a sequence length of 512. -
ClinicalBERT_1a_Large
: initialized from BERT-Large and trained for 300K iterations at a sequence length of 128 and 100K iterations at a sequence length of 512. -
ClinicalBERT_1a_Longer
: initialized from BERT-Base and trained for 1M iterations at a sequence length of 128. -
ClinicalBERT_1a_Large_Longer
: initialized from BERT-Large and trained for 1M iterations at a sequence length of 128. -
Pubmed_ClinicalBERT_1a
: initialized from PubMedBERT [7] and trained for 1M iterations at a sequence length of 128. -
ClinicalBERT_1b
: trained on MIMIC-III notes such that the patient’s surrogate name is prepended to the beginning of every sentence; initialized from BERT-Base and trained for 300K iterations at a sequence length of 128 and 100K iterations at a sequence length of 512. -
ClinicalBERT_templates
: trained on templates in which every sentence contains a patient's surrogate name and a condition that the given patient has; initialized from BERT-Base and trained for 300K iterations at a sequence length of 128 and 100K iterations at a sequence length of 512.
We additionally release all of the CSV files used in training:
-
ICD9_Descriptions.csv
: mapping of ICD-9 codes to text descriptions. -
MedCAT_Descriptions.csv
: mapping of MedCAT Concept Unique Identifier (CUI) codes to text descriptions [5]. -
SUBJECT_ID_to_ICD9.csv
: mapping of a patient's subject id to ICD-9 codes, in which each row contains only 1 subject id and 1 ICD-9 code. -
SUBJECT_ID_to_MedCAT.csv
: mapping of patient's subject id to MedCAT CUI [5], in which each row contains only 1 subject id and 1 CUI. -
SUBJECT_ID_to_NAME.csv
: a mapping of subject id to assigned first and last names. -
reidentified_subject_ids.csv
: a single column CSV that contains the subject ids of patients who have their name mentioned in at least one medical record. -
SUBJECT_ID_to_NOTES_1a.csv
: each row has a single subject id mapped to a clinical note that has name placeholders replaced with the corresponding name given in the SUBJECT_ID_to_NAME.csv. Has an additional column to signify which rows have name placeholders and which remain unmodified. -
SUBJECT_ID_to_NOTES_1b.csv
: each row has a single subject id mapped to a clinical note that has the patient surrogate name prepended to every sentence. -
SUBJECT_ID_to_NOTES_templates.csv
: each row is a single subject id mapped to multiple sentences, each of which contains the patient’s surrogate name and a condition that they have.
All of the information presented above in the setup_outputs/
folder is simply a reorganization of the data released in MIMIC-III. The one unique file are the SUBJECT_ID_to_MedCAT.csv
and MedCAT_Descriptions.csv
files. As mentioned previously, we use MedCAT [6] to extract conditions from patient notes. This yields 2,672 unique conditions. On average, patients are associated with an average of 29.5 unique conditions; conditions comprise 5.37 word piece tokens.
Technical Implementation
To simulate the existence of PHI in the MIMIC-III set, we randomly select surrogate names for all patients (from census data). We re-train BERT [5] over this data following the process outlined by [1], yielding our own version of ClinicalBERT. However, we use full-word (rather than wordpiece) masking, due to the performance benefits this provides. We adopt hyper-parameters from [1], most importantly using three duplicates of static masking. We train using different starting points and for a varying number of epochs; this results in 5 different models.
We additionally explore two simulated pretraining schemas, in which a patient's surrogate "name" (i.e. the one we have given them) appears in every sentence. In the first pretraining scheme, we prepend the patient's surrogate name to the beginning of every sentence. In the second pretraining scheme, we train on sentences that are of the form "[CLS] Mr./Mrs. [FIRST NAME] [LAST NAME] is a yo patient with [CONDITION] [SEP]"
. Here, the [CONDITION]
is a condition that the given patient has mentioned somewhere in the record. For every patient, we have a sentence for each condition that they have.
Given a surrogate patient name that appears in the set of EHR used for pretraining, query the model for the conditions associated with this patient. Operationally, this requires defining a set of conditions against which we can test each patient. In addition to ICD-9 codes (given via MIMIC), we also create lists of conditions to associate with patients by running the MedCAT concept annotation tool [5] over all patient notes. We only keep those extracted entities that correspond to a Disease / Symptom, which we use to normalize condition mentions and map them to their UMLS [6] Concept Unique Identifier (CUI) and description. We release this file that maps subject ids to conditions along with the model weights.
Installation and Requirements
To replicate the environment (Python 3.x) that we used, run the following commands:
git clone https://github.com/elehman16/exposing_patient_data_release
cd exposing_patient_data_release
conda env create -f conda_env.yml
python -m spacy download en
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
In order to run the experiments described in [3], please download this directory and merge the folders with our Github [8]. Then follow README instructions for running code, along with specifying model path locations given by this dataset.
Usage Notes
The primary use of this dataset is to create reproducible experiments for our paper [3] [8]. Further, each of the 7 models, in combination with the labeling schemas listed previously, will allow for users to experiment with new probing techniques to extract patient information and patient-condition relationships. Further to run some our experiments, after following the installation steps, consider the following example:
python experiments/MLM/condition_given_name.py --model $path_to_model --tokenizer bert-base-uncased --condition-type {icd9|medcat} --template-idx {0|1|2|3}
Acknowledgements
We thank Peter Szolovits for early feedback on a draft of our paper [3], and the anonymous NAACL reviewers for their comments. This material is based upon work supported in part by the National Science Foundation under Grant No. 1901117. This Research was also supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).
Conflicts of Interest
This research was in part supported by Google via TFRC.
References
- Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. ArXiv. 2019;abs/1904.05342.
- Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T, et al. Publicly Available Clinical BERT Embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Minneapolis, Minnesota, USA: Association for Computational Linguistics; 2019. p. 72–78. Available from: https://www.aclweb.org/anthology/W19-1909.
- Lehman EP, Jain S, Pichotta K, Goldberg Y, Wallace BC. Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? ArXiv. 2021;abs/2104.07762.
- Johnson AE, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3:160035.
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of DeepBidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Longand Short Papers). Minneapolis, Minnesota: Association for ComputationalLinguistics; 2019. p. 4171–4186. Available from: https://www.aclweb.org/anthology/N19-1423.
- Kraljevi ́c Z, Bean D, Mascio A, Roguski L, Folarin A, Roberts A, et al. MedCAT - Medical Concept Annotation Tool. ArXiv. 2019;abs/1912.10166.
- Bodenreider O. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic acids research. 2004 02;32:D267–70.
- Lehman EP, Jain S, Pichotta K, Goldberg Y, Wallace BC. Code repository for Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? Website. https://github.com/elehman16/exposing_patient_data_release [Accessed on 22 Apr 2021]
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/vk2z-bz63
DOI (latest version):
https://doi.org/10.13026/e4hr-yh29
Project Website:
https://github.com/elehman16/exposing_patient_data_release
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project