Database Credentialed Access

MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization

Asad Aali Dave Van Veen Yamin Arefeen Jason Hom Christian Bluethgen Eduardo Pontes Reis Sergios Gatidis Namuun Clifford Joseph Daws Arash Tehrani Jangwon Kim Akshay Chaudhari

Published: Sept. 10, 2024. Version: 1.0.0 <View latest version>


When using this resource, please cite: (show more options)
Aali, A., Van Veen, D., Arefeen, Y., Hom, J., Bluethgen, C., Reis, E. P., Gatidis, S., Clifford, N., Daws, J., Tehrani, A., Kim, J., & Chaudhari, A. (2024). MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization (version 1.0.0). PhysioNet. https://doi.org/10.13026/vr29-4d92.

Additionally, please cite the original publication:

Aali, A., Van Veen, D., Arefeen, Y., Hom, J., Bluethgen, C., Reis, E. P., Gatidis, S., Clifford, N., Daws, J., Tehrani, A., Kim, J., & Chaudhari, A. (2024). A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models. arXiv preprint arXiv:2403.05720.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

This dataset presents a curated collection of preprocessed and labeled clinical notes derived from the MIMIC-IV-Note database. The primary aim of this resource is to facilitate the development and training of machine learning models focused on summarizing brief hospital courses (BHC) from clinical discharge notes.

The dataset contains 270,033 meticulously cleaned and standardized clinical notes containing an average token length of 2,267, ensuring usability for machine learning (ML) applications. Each clinical note is paired with a corresponding BHC summary, providing a robust foundation for supervised learning tasks. The preprocessing pipeline employed uses regular expressions to address common issues in the raw clinical text, such as special characters, extraneous whitespace, inconsistent formatting, and irrelevant text, to produce a high-quality, structured dataset with separated clinical note sections through appropriate headings.

By offering this resource, we aim to support healthcare professionals and researchers in their efforts to enhance patient care through the automation of BHC summarization. This dataset is ideal for exploring various NLP techniques, developing predictive models, and improving the efficiency and accuracy of clinical documentation practices. We invite the research community to utilize this dataset to advance the field of medical informatics and contribute to better health outcomes.


Background

Free-text clinical notes, including discharge summaries, play a crucial role in the continuum of care in healthcare systems. While structured digital information systems have become widespread, healthcare providers continue to rely on free-text documentation to communicate effectively, relay critical health information to patients, and maintain comprehensive care plans. These notes are not only vital for day-to-day patient management but also serve as a rich source of information for researchers seeking to understand patients' clinical journeys [1-3].

Discharge summaries, in particular, encapsulate a patient's entire hospital stay, providing a detailed account of the major events, treatments, and outcomes. The brief hospital course (BHC) section within these summaries succinctly describes the patient's progress and key interventions, offering a snapshot of the hospitalization. For researchers and healthcare professionals, access to structured and labeled discharge summaries can significantly enhance the development of predictive models and improve clinical documentation practices [4].

However, the potential of free-text clinical notes for advancing research and improving patient outcomes has been hampered by challenges in access and processing. Deidentifying clinical text to protect patient privacy is a complex and resource-intensive task. Despite these challenges, datasets like MIMIC-III [5] and MIMIC-IV [6, 7] have demonstrated the immense value of deidentified clinical notes for natural language processing (NLP) and other research applications.

Recent advancements in NLP, particularly in the area of deidentification, have made it feasible to share clinical text in a way that ensures patient privacy. These advancements have paved the way for the creation of datasets that can be used to train machine-learning models for various tasks, including the summarization of brief hospital courses. By leveraging these technologies, the MIMIC-IV-Ext-BHC dataset aims to support the development of tools that can automate and improve the summarization process, ultimately enhancing the quality and efficiency of healthcare delivery.


Methods

Data Source

The MIMIC-IV-Ext-BHC dataset was derived from the publicly available MIMIC-IV-Note dataset. The original MIMIC-IV-Note dataset contains free-text clinical notes from the Beth Israel Deaconess Medical Center between 2008 and 2019. These notes include discharge summaries, nursing notes, and other documentation integral to patient care. For this project, we focused specifically on discharge summaries to create a labeled dataset for brief hospital course (BHC) summarization.

Data Selection and Inclusion Criteria

All inclusion criteria for MIMIC-IV also apply to this dataset. Specifically, only notes occurring within one year of a patient encounter are included, where an encounter is defined as an emergency department visit or a hospital stay. The selected notes contain detailed documentation of the patient's hospitalization, including the BHC section. To ensure the presence of meaningful and high-quality bhc summaries, we excluded note-bhc pairs where the bhc contained less than 100 characters.

Preprocessing Pipeline

The preprocessing of the raw clinical notes involved several key steps to ensure the data was clean, standardized, and suitable for machine learning applications. The preprocessing pipeline was implemented using Python, with libraries such as pandas, numpy, re, and nltk. The steps are as follows:

  1. Data Loading: The raw clinical notes were loaded into a pandas DataFrame from a compressed CSV file, discharge.csv.gz (available on PhysioNet [6]).
  2. Separation and Type Conversion: The brief hospital course sections were separated from the "text" column. Noting the consistency across clinical notes and their sections, we used regular expression (RegEx) to filter and extract the entire substring under the heading "Brief Hospital Course". Clinical notes not containing the BHC section were excluded from the dataset to ensure consistency. All other relevant columns were converted to string type.
  3. Whitespace and Formatting Cleanup: Extraneous whitespace was removed, and consistent formatting was applied to ensure uniformity across the dataset.
  4. Section Identification: Notes that contained the section "Sex" were retained, and others were discarded. This was done to ensure consistency among clinical notes regarding the availability of crucial patient information helpful for further subgroup analyses [8].
  5. Section Headers Standardization: Section headers were standardized to uppercase within angle brackets to facilitate easier parsing and extraction.
  6. Further Cleanup: Additional formatting cleanup was performed to remove line breaks, and extraneous symbols, and ensure uniform spacing.
  7. BHC Cleanup: The BHC section was cleaned by removing the header and extraneous line breaks, ensuring consistent formatting.

The cleaned and standardized dataset was saved as a new CSV file for use in machine learning tasks. By applying this preprocessing pipeline, we produced a high-quality dataset of labeled discharge notes suitable for training and evaluating machine learning models aimed at BHC summarization. This resource aims to support the development of NLP tools that can improve the efficiency and accuracy of clinical documentation, ultimately enhancing patient care.

Data Quality Validation

We applied a uniform set of processing steps across all records and manually reviewed 100 randomly drawn samples from the preprocessed dataset to validate the effectiveness of the preprocessing pipeline. Furthermore, a team of clinicians manually reviewed 30 clinical note-BHC pairs from the dataset, without reporting any quality issues [8].


Data Description

This dataset has been meticulously preprocessed to separate the BHC from the clinical note text, creating a labeled dataset suitable for supervised learning tasks.

File Structure and Formats

The dataset consists of a single compressed CSV file, mimic-iv-bhc.csv, which contains the following columns:

  • index: This column serves as a unique identifier for each row in the dataset, ensuring that each discharge note and its associated BHC can be easily referenced. The unique identifiers were set to match the identifiers from the original mimic-iv-note dataset.
  • input: This column contains the preprocessed discharge note text. The text provides a detailed narrative of the patient's hospitalization, including sections like medical history, treatments, and discharge instructions, but excluding the BHC.
  • target: This column contains the BHC text, which has been cleaned and standardized. The BHC provides a concise summary of the patient's hospital course, including major events, treatments, and outcomes during the hospital stay.
  • input_tokens: Contains the token length of the clinical note input using the gpt-4 tokenizer.
  • target_tokens: Contains the token length of the bhc target using the gpt-4 tokenizer.

Summary Statistics

  • Total Records: The dataset contains a total of 270,033 records.
  • Input Token Length: The clinical note inputs, on average, have a token length of 2,267  ± \pm  914.
  • Target Token Length: The bhc targets, on average, have a token length of 564  ± \pm  410.
  • Period: The original notes were collected between 2008 and 2019, covering a broad period and a diverse patient population.
  • Patient Demographics: The dataset includes notes from a wide range of patients, representative of the MIMIC-IV population, which encompasses various age groups, genders, and medical conditions.

Usage Notes

Previous Uses of the Dataset

The MIMIC-IV-Ext-BHC dataset has been utilized in a recent work [8]. This study used the dataset to train large language models (LLMs) to generate brief hospital course summaries from clinical notes, demonstrating the effectiveness of domain-adapted LLMs in clinical summarization tasks.

Reuse Potential

The dataset holds significant potential for various future research and practical applications:

  • Performance Benchmarks: Researchers can use this dataset to create benchmarks for evaluating the performance of different machine learning models in the task of brief hospital course summarization.
  • Automated Summarization: The dataset can aid in the development of tools to automatically generate BHCs from clinical notes, enhancing the efficiency of clinical documentation and potentially improving patient care.
  • Clinical Decision Support: Models trained on this dataset can be integrated into clinical decision support systems, providing healthcare providers with concise summaries of patient hospital courses to inform their decision-making processes.
  • Medical Informatics Research: The dataset can support studies investigating the relationship between clinical note content and patient outcomes, contributing to advancements in healthcare practices and policies.

Known Limitations

While the dataset offers substantial utility, users should be aware of certain limitations:

  • Data Deidentification: Despite rigorous de-identification efforts, there may still be minimal residual risk of reidentification. Users must handle the data responsibly and comply with all relevant data protection regulations.
  • Generalizability: The dataset is derived from a single institution, which may limit the generalizability of models trained on this data to other healthcare settings with different documentation practices.

Complementary Resources

To facilitate the creation of the preprocessed labeled dataset directly from the original discharge.csv.gz, we provide a Python script. This script automates the preprocessing pipeline, ensuring consistency and reproducibility. Users can access the script and additional resources via the following repository [9].

By providing this dataset and complementary resources, we aim to support the user community in advancing research and practical applications that improve clinical documentation and patient care.


Release Notes

MIMIC-IV-Ext-BHC v1.0.0

This is the initial release of the MIMIC-IV-Ext-BHC. This first version aims to provide a valuable resource for researchers and practitioners in the field of natural language processing and clinical documentation. Future updates may include additional records, further preprocessing improvements, and expanded metadata.


Ethics

The collection of patient information and creation of the research resource was reviewed by the Institutional Review Board at the Beth Israel Deaconess Medical Center, which granted a waiver of informed consent and approved the data-sharing initiative.


Acknowledgements

We would like to thank One Medical and Stanford University for providing partial computing resources for this project. We also extend our gratitude to Microsoft for supplying Azure OpenAI credits through the Accelerate Foundation Models Academic Research (AFMAR) program.

Additionally, we would like to acknowledge A.C. for receiving research support from various NIH grants, including R01 HL167974, R01 HL169345, R01 AR077604, R01 EB002524, R01 AR079431, and P41 EB027060, as well as NIH contracts 75N92020C00008 and 75N92020C00021. Support was also provided by the Stanford Center for Artificial Intelligence and Medicine, Stanford Institute for Human-Centered AI, Stanford Center for Digital Health, Stanford Cardiovascular Institute, Stanford Center for Precision Health and Integrated Diagnostics, and corporate partners GE Healthcare, Philips, and Amazon.


Conflicts of Interest

None to declare.


References

  1. Moy, A., Schwartz, J., Chen, R., et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. Journal Of The American Medical Informatics Association. 28, 998-1008 (2021)
  2. Chaiyachati, K., Shea, J., Asch, D., et al. Assessment of inpatient time allocation among first-year internal medicine residents using time-motion observations. JAMA Internal Medicine. 179, 760-767 (2019)
  3. Mamykina, L., Vawdrey, D. & Hripcsak, G. How do residents spend their shift time? A time and motion study with a particular focus on the use of computers. Academic Medicine: Journal Of The Association Of American Medical Colleges. 91, 827 (2016)
  4. Clough, R., Sparkes, W., Clough, O., Sykes, J., Steventon, A. & King, K. Transforming healthcare documentation: Harnessing the potential of AI to generate discharge summaries. BJGP Open. (2023)
  5. Johnson, A., Pollard, T., & Mark, R. MIMIC-III Clinical Database. PhysioNet. (2016)
  6. Johnson, A., Pollard, T., Horng, S., Celi, L. & Mark, R. MIMIC-IV-Note: Deidentified free-text clinical notes. PhysioNet. (2023)
  7. Johnson, A., Bulgarelli, L., Shen, L., et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data. 10(1) (2023)
  8. Aali, A., Van Veen, D., Arefeen, Y. I., et al. A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models. arXiv preprint arXiv:2403.05720. (2024)
  9. Hospital Course Summarization with Adapted Large Language Models. Available from: https://github.com/StanfordMIMI/clin-bhc-summ [Accessed July 8, 2024]

Parent Projects
MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.
Versions

Files