Model Credentialed Access
Shareable Artificial Intelligence to Extract Cancer Outcomes from Electronic Health Records for Precision Oncology Research
Kenneth Kehl , Pavel Trukhanov , Christopher Fong , Justin Jee , Karl Pichotta , Morgan Paul , Chelsea Nichols , Michele Waters , Nikolaus Schultz , Deborah Schrag
Published: Oct. 24, 2024. Version: 1.0.0
When using this resource, please cite:
(show more options)
Kehl, K., Trukhanov, P., Fong, C., Jee, J., Pichotta, K., Paul, M., Nichols, C., Waters, M., Schultz, N., & Schrag, D. (2024). Shareable Artificial Intelligence to Extract Cancer Outcomes from Electronic Health Records for Precision Oncology Research (version 1.0.0). PhysioNet. https://doi.org/10.13026/h2nj-p344.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Databases that link molecular data to clinical outcomes can inform precision cancer research into novel prognostic and predictive biomarkers. However, outside of clinical trials, cancer outcomes are typically recorded only in text form within electronic health records (EHRs). Artificial intelligence (AI) models have been trained to extract outcomes from individual EHRs. However, patient privacy restrictions have historically precluded dissemination of these models beyond the centers at which they were trained. In this study, the vulnerability of text classification models trained directly on protected health information to membership inference attacks was confirmed. A teacher-student distillation approach was applied to develop shareable models for annotating outcomes from imaging reports and medical oncologist notes. ‘Teacher’ models trained on EHR data from Dana-Farber Cancer Institute (DFCI) were used to label imaging reports and discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. ‘Student’ models were trained to use these MIMIC documents to predict the labels assigned by teacher models and sent to Memorial Sloan Kettering (MSK) for evaluation. The student models exhibited high discrimination across outcomes in both the DFCI and MSK test sets. These student models, “DFCI-imaging-student” and “DFCI-medonc-student,” are shared here.
Background
The incorporation of electronic health records (EHRs) into routine cancer care delivery could accelerate precision cancer research by unlocking data on patient phenotypes and clinical outcomes at scale [1]. For example, Project GENIE[2] from the American Association for Cancer Research (AACR) has gathered deidentified tumor genomic next-generation sequencing panels from tens of thousands of patients to inform observational precision oncology research. However, key phenotypic variables needed for this purpose, such as cancer type, external treatment history, performance status, and clinical outcomes, are generally recorded only in unstructured text in routine practice.
To address this challenge, the AACR GENIE Biopharma Collaborative [3] (BPC) project is gathering clinical data at scale via extensive annotation of EHR documents for a subset of GENIE patients with solid tumors to facilitate patient-relevant research questions about the impact of tumor molecular characteristics on clinical outcomes. Records for approximately 16,000 patients are undergoing highly granular manual annotation; approximately 150,000 imaging reports, pathology reports, and clinical notes have already been curated. Still, this dataset is an order of magnitude smaller than the number of tumor specimens for which genomic data are available in GENIE.[4] The manual medical records review needed to obtain these clinical variables is too slow and resource-intensive to scale further. To make full use of clinico-genomic datasets, natural language processing (NLP) and artificial intelligence (AI) methods are needed to extract key validated oncologic phenotypes from longitudinal EHR data.
Several cancer centers and research groups [5–8] have developed AI models that can rapidly extract cancer features and endpoints from their own unstructured data [9–11]. Large language models have also been prompted to directly extract variables from unstructured text [12]. However, central questions regarding AI-based feature extraction include (a) how well these models and processes generalize across populations and cancer centers, and (b) whether the models can be safely shared without unintended breaches in patient privacy. When a neural network model is trained on protected health information (PHI), there is a risk that the model might encode or ‘memorize’ the training data within its weights [13]. That PHI could theoretically be revealed if the model is later used for text generation or exposed to adversarial attacks. It has been demonstrated that neural networks, even those used for classification, can memorize and expose unique features such as names and numeric identifiers from a single training example [14]. One example of such an attack is a membership inference attack, in which an attack model is trained to predict whether a given observation was included in a target model’s training data [15]. This is akin to trying to determine if a given patient’s clinical note was included in model training, which might yield information about the patient’s history. This creates ethical concerns and regulatory barriers to sharing models or even performing federated learning [16] for decentralized training and feature extraction across sites.
Methods for deep privacy-preserving phenotype extraction in cancer are limited. One strategy could be to simply limit a model’s vocabulary to words or tokens that are present in public, PHI-free datasets [17]. Still, as language models and tokenizers become more complex, there could still be privacy risks inherent to combinations of words, especially for fully unstructured documents such as clinical progress notes. Another approach to the model sharing challenge could be a “teacher-student,” akin to model distillation, paradigm [18]. This would involve training a “teacher” model on PHI to extract clinical outcomes, then applying that model to a publicly available, PHI-free text dataset to generate labels for that public database. Alternatively, a large language model could serve as the “teacher” by prompting it to label such a public dataset. A “student” model could then be trained using the public text to predict the labels assigned by the teacher(s). In the current study, a teacher-student framework was used to train AI/NLP models to extract clinical outcomes from imaging reports and medical oncologist notes at one academic cancer center for evaluation at a second center for patients in a multi-institutional clinico-genomic cohort. The resulting student models, DFCI-imaging-student and DFCI-medonc-student, are submitted for publication on PhysioNet.
Model Description
Overview
This submission includes two artificial intelligence (AI) models, DFCI-imaging-student and DFCI-medonc-student, developed to extract cancer outcomes from electronic health records (EHRs). These models were created using a teacher-student distillation approach to ensure privacy-preserving sharing of AI models across institutions. The teacher models were trained on protected health information (PHI) from Dana-Farber Cancer Institute (DFCI) and used to label publicly available data from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. The student models were then trained to predict the teachers’ labels on the MIMIC data and evaluated at Memorial Sloan Kettering Cancer Center (MSK).
Files and Directories
The submission package includes the following directories and files:
README.md
: Provides an overview of the project and usage guidelines.data/
: Contains small synthetic datasets for testing the DFCI-imaging-student and DFCI-medonc-student models.src/
: Includes example code for running inference with the models. To run, change to the imaging or medonc folder, then run pip -r requirements.txt, then either imaging_inference.py or medonc_inferenc.epy respectively. This will run inference on small synthetic datasets in the data/ folder.models/
: Contains the pre-trained weights for the DFCI-imaging-student and DFCI-medonc-student models.
Model Descriptions
DFCI-imaging-student
- Purpose: Extracts cancer outcomes from imaging reports.
- Architecture: BERT-base-uncased model fine-tuned on MIMIC radiology reports labeled by the DFCI-imaging-teacher model.
- Outcomes: Extracts the outcomes of any cancer, progression, response, and metastases to various organs (brain, bone, liver, adrenal, lung, lymph node, and peritoneum).
DFCI-medonc-student
- Purpose: Extracts cancer outcomes from medical oncologist notes.
- Architecture: Clinical-Longformer model fine-tuned on MIMIC discharge summaries labeled by the DFCI-medonc-teacher model.
- Outcomes: Extracts the outcomes of any cancer, progression, and response.
Data
- MIMIC-IV Dataset: The MIMIC data used to train the student models are available to researchers from PhysioNet.
- Synthetic Data: Synthetic datasets are provided for testing the models without exposing PHI.
Code
- Training and Evaluation: The source code (Jupyter notebooks) for training and evaluating the models is available on GitHub:
- DFCI-imaging-student: https://github.com/prissmmnlp/dfci_msk_teacher_student_imaging
- DFCI-medonc-student: https://github.com/prissmmnlp/dfci_msk_teacher_student_medonc
Evaluation
- Metrics: The models were evaluated using the area under the receiver operating characteristic curve (AUROC), area under the precision recall curve (AUPRC), and best F1 score.
- Test Sets: Evaluation was performed on held-out test sets from DFCI and MSK.
Technical Implementation
The process for training and evaluating the models is described below.
Datasets
The following datasets were used for training and evaluation:
- AACR Project GENIE and GENIE BPC datasets: GENIE has collected genomic data and basic clinical data, such as cancer type, patient age at sequencing, and demographics, from well over 100,000 tumor specimens. The BPC project involves annotation of EHRs to extract exposure variables and outcomes. This includes outcome annotation of each imaging report and one medical oncologist note per month using the PRISSMM data model.
- MIMIC-IV dataset: The Medical Information Mart for Intensive Care (MIMIC) dataset consists of deidentified structured and unstructured EHR data for patients who have been hospitalized in the intensive care unit at Beth Israel Deaconess Medical Center. The current version of the dataset, MIMIC-IV, includes unstructured radiology reports and discharge summaries for this cohort [19, 20]. The MIMIC data are available on request to credentialed researchers on PhysioNet [21]
Model Training
Teacher models:
- DFCI-imaging-teacher: A BERT-base-uncased [22] model was fine-tuned to predict annotations of any cancer, progression, response, and metastases to specific organs (brain, bone, liver, adrenal gland, lung, lymph nodes, and peritoneum) from imaging reports.
- DFCI-medonc-teacher: A Clinical-Longformer [23] model was fine-tuned to predict annotations of any cancer, progression, and response from medical oncologist notes.
Student Models:
- DFCI-imaging-student: A BERT-base-uncased model was fine-tuned to predict labels generated by DFCI-imaging-teacher on MIMIC radiology reports.
- DFCI-medonc-student: A Clinical-Longformer model was fine-tuned to predict labels generated by DFCI-medonc-teacher on MIMIC discharge summaries.
Model Evaluation
Model evaluation was carried out on held-out test sets from the DFCI and MSK GENIE BPC datasets. The following metrics were evaluated:
- Area under the receiver operating characteristic curve (AUROC)
- Area under the precision recall curve (AUPRC)
- Best F1 score.
Installation and Requirements
The src/
folder Includes example code for running inference with the models.
To run within a Python environment, change to the src/imaging
or src/medonc
folder, then run pip -r requirements.txt
to install the required packages.
Next, run either python imaging_inference.py
or python medonc_inference.py
respectively to run inference on small synthetic datasets in the data/
folder.
Usage Notes
A manuscript describing the training and evaluation process and results for these models in more detail is currently under review. The study team has evaluated the models at DFCI and MSK, as described above, but not at other institutions; external research teams should validate the models on their own datasets before using for any research purpose. No guarantees regarding model performance are provided or implied. The models are not reviewed or approved by any clinical regulatory agency and should not be used for clinical decision support or in routine clinical care.
Release Notes
This is version 1.0.0.
Ethics
This research was conducted under Dana-Farber/Harvard Cancer Center IRB protocol #16-360. Patients whose tumors underwent next-generation sequencing (NGS) at Dana-Farber Cancer Institute on a research basis provided written informed consent. Patients whose tumors underwent NGS on a clinical basis were included under a waiver of informed consent, given the minimal risk of this research. The underlying protected health information used to train and evaluate teacher models and to evaluate student models is not shared here. Student models were trained using the MIMIC-IV data made available on PhysioNet.
Acknowledgements
This work was funded by the National Institutes of Health/National Cancer Institute (R00CA245899; P30CA008748) and the United States Department of Defense (W81XWH-22-1-0086). The content was developed or derived using the PRISSMM™ system licensed and enhanced by Memorial Sloan-Kettering Cancer Center, Memorial Hospital for Cancer and Allied Diseases, and Sloan-Kettering Institute for Cancer Research (collectively “MSK”). Original system and improvements © 2019-2022 Dana-Farber Cancer Institute, Inc. Additional functionality and enhancements © 2023 MSK. All rights reserved. Memorial Sloan-Kettering Cancer Center, MSK, PRISSMM, and all associated logos are trademarks™ or registered® trademarks of MSK.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Hernandez-Boussard T, Blayney DW, Brooks JD. Leveraging Digital Data to Inform and Improve Quality Cancer Care. Cancer Epidemiol Biomarkers Prev [Internet]. 2020 Apr;29(4):816–22. Available from: http://dx.doi.org/10.1158/1055-9965.EPI-19-0873
- AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov [Internet]. 2017 Aug;7(8):818–31. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC5611790
- Project GENIE Announces Biopharma Collaboration. Cancer Discov [Internet]. 2020 Feb;10(2):OF2. Available from: http://dx.doi.org/10.1158/2159-8290.CD-NB2019-144
- Pugh TJ, Bell JL, Bruce JP, Doherty GJ, Galvin M, Green MF, et al. AACR Project GENIE: 100,000 Cases and Beyond. Cancer Discov [Internet]. 2022 Sep 2;12(9):2044–57. Available from: http://dx.doi.org/10.1158/2159-8290.CD-21-1547
- Kehl KL, Xu W, Gusev A, Bakouny Z, Choueiri TK, Riaz IB, et al. Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset. Nat Commun [Internet]. 2021 Dec 15;12(1):7304. Available from: http://dx.doi.org/10.1038/s41467-021-27358-6
- Kehl KL, Elmarakeby H, Nishino M, Van Allen EM, Lepisto EM, Hassett MJ, et al. Assessment of Deep Natural Language Processing in Ascertaining Oncologic Outcomes From Radiology Reports. JAMA Oncol [Internet]. 2019 Oct 1;5(10):1421–9. Available from: http://dx.doi.org/10.1001/jamaoncol.2019.1800
- Kehl KL, Xu W, Lepisto E, Elmarakeby H, Hassett MJ, Van Allen EM, et al. Natural Language Processing to Ascertain Cancer Outcomes From Medical Oncologist Notes. JCO Clin Cancer Inform [Internet]. 2020 Aug;4:680–90. Available from: http://dx.doi.org/10.1200/CCI.20.00020
- Kehl KL, Groha S, Lepisto EM, Elmarakeby H, Lindsay J, Gusev A, et al. Clinical Inflection Point Detection on the Basis of EHR Data to Identify Clinical Trial-Ready Patients With Cancer. JCO Clin Cancer Inform [Internet]. 2021 Jun;5:622–30. Available from: http://dx.doi.org/10.1200/CCI.20.00184
- Jiang LY, Liu XC, Nejatian NP, Nasir-Moin M, Wang D, Abidin A, et al. Health system-scale language models are all-purpose prediction engines. Nature [Internet]. 2023 Jun 7; Available from: http://dx.doi.org/10.1038/s41586-023-06160-y
- Arbour KC, Luu AT, Luo J, Rizvi H, Plodkowski AJ, Sakhi M, et al. Deep learning to estimate RECIST in patients with NSCLC treated with PD-1 blockade. Cancer Discov [Internet]. 2020 Sep 21; Available from: http://dx.doi.org/10.1158/2159-8290.CD-20-0419
- Rahman P, Ye C, Mittendorf KF, Lenoue-Newton M, Micheel C, Wolber J, et al. Accelerated curation of checkpoint inhibitor-induced colitis cases from electronic health records. JAMIA Open [Internet]. 2023 Apr;6(1):ooad017. Available from: http://dx.doi.org/10.1093/jamiaopen/ooad017
- Huang J, Yang DM, Rong R, Nezafati K, Treager C, Chi Z, et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med [Internet]. 2024 May 1;7(1):106. Available from: http://dx.doi.org/10.1038/s41746-024-01079-8
- Lehman E, Jain S, Pichotta K, Goldberg Y, Wallace BC. Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? 2021 Apr 15; Available from: http://arxiv.org/abs/2104.07762
- Hartley J, Sanchez PP, Haider F, Tsaftaris SA. Neural networks memorise personal information from one sample. Sci Rep [Internet]. 2023 Dec 4;13(1):21366. Available from: http://dx.doi.org/10.1038/s41598-023-48034-3
- Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models [Internet]. arXiv [cs.CR]. 2016 [cited 2024 Jul 3]. Available from: http://arxiv.org/abs/1610.05820
- Rajendran S, Obeid JS, Binol H, D`Agostino R, Foley K, Zhang W, et al. Cloud-Based Federated Learning Implementation Across Medical Centers. JCO Clinical Cancer Informatics [Internet]. 2021 Jan;(5):1–11. Available from: https://ascopubs.org/doi/10.1200/CCI.20.00060
- Alawad M, Yoon H-J, Gao S, Mumphrey B, Wu X-C, Durbin EB, et al. Privacy-Preserving Deep Learning NLP Models for Cancer Registries. IEEE Trans Emerg Top Comput [Internet]. 2021 Jul-Sep;9(3):1219–30. Available from: http://dx.doi.org/10.1109/tetc.2020.2983404
- Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network [Internet]. arXiv [stat.ML]. 2015. Available from: http://arxiv.org/abs/1503.02531
- Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data [Internet]. 2023 Jan 3;10(1):1. Available from: http://dx.doi.org/10.1038/s41597-022-01899-x
- Johnson, A., Bulgarelli, L., Pollard, T., Gow, B., Moody, B., Horng, S., Celi, L. A., & Mark, R. (2024). MIMIC-IV (version 3.1). PhysioNet. https://doi.org/10.13026/kpb9-mt58.
- Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation [Internet]. 2000 Jun 13;101(23):E215-20. Available from: http://dx.doi.org/10.1161/01.cir.101.23.e215
- Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018 Oct 10; Available from: http://arxiv.org/abs/1810.04805
- 1. Hernandez-Boussard T, Blayney DW, Brooks JD. Leveraging Digital Data to Inform and Improve Quality Cancer Care. Cancer Epidemiol Biomarkers Prev [Internet]. 2020 Apr;29(4):816–22. Available from: http://dx.doi.org/10.1158/1055-9965.EPI-19-0873 2. AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov [Internet]. 2017 Aug;7(8):818–31. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC5611790 3. Project GENIE Announces Biopharma Collaboration. Cancer Discov [Internet]. 2020 Feb;10(2):OF2. Available from: http://dx.doi.org/10.1158/2159-8290.CD-NB2019-144 4. Pugh TJ, Bell JL, Bruce JP, Doherty GJ, Galvin M, Green MF, et al. AACR Project GENIE: 100,000 Cases and Beyond. Cancer Discov [Internet]. 2022 Sep 2;12(9):2044–57. Available from: http://dx.doi.org/10.1158/2159-8290.CD-21-1547 5. Kehl KL, Xu W, Gusev A, Bakouny Z, Choueiri TK, Riaz IB, et al. Artificial intelligence-aided clin
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/h2nj-p344
DOI (latest version):
https://doi.org/10.13026/mjv2-8h15
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project