Database Credentialed Access
MIMIC-IV-Ext-GPT-3_5-Generated-Discharge-Summaries-for-Low-Resource-Codes
Matúš Falis , Aryo Pradipta Gema , Hang Dong , Luke Daines , Siddharth Basetti , Michael Holder , Rose Penfold , Alexandra Birch , Beatrice Alex
Published: Dec. 16, 2024. Version: 1.0.0
When using this resource, please cite:
(show more options)
Falis, M., Gema, A. P., Dong, H., Daines, L., Basetti, S., Holder, M., Penfold, R., Birch, A., & Alex, B. (2024). MIMIC-IV-Ext-GPT-3_5-Generated-Discharge-Summaries-for-Low-Resource-Codes (version 1.0.0). PhysioNet. https://doi.org/10.13026/09ng-2614.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
This dataset comprises 9,606 Synthetic Discharge Summaries generated by GPT-3.5 based on combinations of ICD-10-code descriptions associated with real discharge summaries in MIMIC-IV. As part of the generation process, GPT-3.5 was also tasked to code the discharge summaries (presented in square brackets in the Discharge sections) and predict a patient discharge status of 'DEAD' or 'ALIVE'.
Each generated discharge summary is presented with the following data – a list of ICD-10-CM and ICD-10-PCS codes upon which the prompt was based, the textual descriptions of said ICD-10-CM and ICD-10-PCS codes which are used in the prompt, the prompt itself, the raw output (including the predicted ICD-10 codes assigned to descriptions). We further provide the processed version of these discharge summaries we used in our augmentation experiments.
Background
The data was created as part of the exploration of the Large Language Model (LLM) GPT-3.5's ability to generate and code discharge summaries with the primary aim of using the synthetic data to address concept sparsity within training local neural ICD-10 coding models. MIMIC-IV [1] suffers from a big-head long-tail distribution of ICD-10 codes, with a large number of the codes present within the dataset appearing very few times, or even once only. Given a dataset split, this is likely to result in some codes being few-shot or zero-shot with respect to the evaluation sets. This data sparsity in turn results in poor performance of locally-trained neural ICD coding models, especially on the rare labels. The data was generated in order to explore the possibility of addressing concept sparsity and improving performance on low-resource labels through LLM-generated discharge summaries based on real combinations of ICD-10 codes from the dataset.
As there exist co-relations among labels (e.g., different complications of type 1 diabetes, cancer co-relating with the presence of chemotherapy), rather than creating random combinations of ICD codes we opted to work our way back from existing real scenarios.
For zero-shot scenarios, gold standard sets with related existing ICD codes with an "unspecified" aetiology were found, and the relatively more frequent "unspecified" codes were replaced with their zero-shot siblings for generation.
Methods
We have chosen to work with the dataset split of Nguyen et al. ([2]). In Nguyen's MIMIC-IV training set, we found documents with the 98 relevant few-shot codes (the 16 zero-shot were by-definition absent). Some documents contained multiple relevant codes. We have collected documents for each of the relevant codes and cloned them to bring their population up to 100. To increase variety, we have randomly dropped up to 5 of the assigned non-relevant labels within clones to create the new set of labels for generation (referred to as the silver standard).
We identified documents containing the siblings of the 16 zero-shot labels. A silver standard set was created for each of these documents substituting the sibling code with the zero-shot code (similar to the zero-shot approach in [3]). If multiple siblings were present, a random one was replaced with the zero-shot code. This resulted in 9,606 input sets of labels – 6,779 unique and 2,827 duplicated.
We used the model "gpt-3.5-turbo-0613" given its wide recognition in the field of LLMs, relative cost-effectiveness, and time efficiency (compared to gpt-4). We utilised a temperature (parameter in the 0-1 range controlling randomness) of 0 to produce deterministic outputs. We set the temperature to 0.1 for duplicates, allowing output variation.
Within the prompt, we specified to write a discharge summary for a patient with a list of standard descriptions of their conditions and procedures based on our silver standard. We added further specifications, such as the inclusion of social and family history, avoiding explicit mentions of ICD-10 codes in the main body of text, or the omission of the keyword "unspecified" present in standard descriptions (opting for a more natural-sounding surface form). Furthermore, a request for anonymisation was included for personal and location data was included in the prompt (due to uncertainty of the anonymity of the data used in training of GPT-3.5), maintaining numeric information when relevant. For further details refer to [4].
Data Description
The unprocessed output is provided alongside the codes translated into descriptions within the prompts (the file gpt35_generated_discharge_summaries_raw.csv). Within the raw data file, for each datapoint each code columns (be it procedure, conditions, or "target" combining the two) contains a string of ICD codes separated by the ';' character. These are further accompanied by description columns where each entry contains a string with the corresponding code descriptions separated by the '|' character. Furthermore, each row contains the prompt used and the generated text
The processed text (including removal of explicit mentions of ICD-10 codes, lowercasing, and removal of numeric strings based on the preprocessing of [5]) – as used in our experiments with local neural networks – is presented in the file gpt35_generated_discharge_summaries_processed.csv
Beyond the text and target codes, the file has dummy values for other data found in MIMIC-IV, such as patient, document, and admission IDs (for each of these, there are no duplicates). For convenience in order to easily identify them, the numeric information within these IDs was set to be lower or equal to 0.
Usage Notes
The data has been generated in the work of [4]. As part of this work the data was used for (1) data augmentation when training local neural ICD-coding models; (2) analysis of GPT-3.5's coding ability based standard descriptions of ICD-10 codes; (3) analysis of synthetic discharge summaries by clinical staff with experience in ICD coding and comparison against real MIMIC-IV discharge summaries based on similar sets of labels.
Further suggested use of data: supplementing a training set similar to (1); analysis of GPT-3.5's predictions of a patient's survival (the discharge status stated in the generated document) given the input set of conditions and procedures (comorbidities); as a baseline for the development of prompts for generating synthetic discharge summaries based on ICD-10 codes; as a baseline for further development of synthetic discharge summary generation on other large language models.
A known limitation of the dataset is the use or a regular expression to extract the predicted ICD-10 codes from the generated data. While the regular expression was tuned using a sample of documents which presented some variety in presentation of the predicted codes, we did not investigate its performance on each individual document and hence some may have been missed.
Release Notes
Version 1.0.0: First release.
Ethics
No data from human/animal subjects was collected, only synthetic data produced by GPT-3.5 based on labels from MIMIC-IV.
Acknowledgements
This work is supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. HD is supported by the Engineering and Physical Sciences Research Council (EPSRC, grant EP/V050869/1), Concur: Knowledge Base Construction and Curation. RSP is a fellow on the Multimorbidity Doctoral Training Programme for Health Professionals, which is supported by the Wellcome Trust [223499/Z/21/Z]. BA is supported by: 1) the National Institute for Health Research (NIHR) Artificial Intelligence and Multimorbidity: Clustering in Individuals, Space and Clinical Context (AIM-CISC) grant NIHR202639 and by 2) Legal and General Group as part of their corporate social responsibility (CSR) programme, providing a research grant to establish the independent Advanced Care Research Centre at the University of Edinburgh.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, Lehman LW. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data. 2023 Jan 3;10(1):1.
- Nguyen TT, Schlegel V, Kashyap A, Winkler S, Huang SS, Liu JJ, Lin CJ. Mimic-iv-icd: A new benchmark for extreme multilabel classification. arXiv preprint arXiv:2304.13998. 2023 Apr 27.
- Falis M, Dong H, Birch A, Alex B. Horses to Zebras: Ontology-Guided Data Augmentation and Synthesis for ICD-9 Coding. InProceedings of the 21st Workshop on Biomedical Language Processing 2022 May (pp. 389-401).
- Falis M, Gema AP, Dong H, Daines L, Basetti S, Holder M, Penfold RS, Birch A, Alex B. Can GPT-3.5 Generate and Code Discharge Summaries?. arXiv preprint arXiv:2401.13512. 2024 Jan 24.
- Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable Prediction of Medical Codes from Clinical Text. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) 2018 Jun (pp. 1101-1111).
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/09ng-2614
DOI (latest version):
https://doi.org/10.13026/bnc2-1a81
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project