Name: Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries
Published: April 30, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Stefan Hegselmann , Shannon Shen , Florian Gierse , Monica Agrawal , David Sontag , Xiaoyi Jiang

Published: April 30, 2025. Version: 1.0.1

When using this resource, please cite: (show more options)
Hegselmann, S., Shen, S., Gierse, F., Agrawal, M., Sontag, D., & Jiang, X. (2025). Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries (version 1.0.1). PhysioNet. https://doi.org/10.13026/gedc-j464.

MLA	Hegselmann, Stefan, et al. "Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries" (version 1.0.1). PhysioNet (2025), https://doi.org/10.13026/gedc-j464.
APA	Hegselmann, S., Shen, S., Gierse, F., Agrawal, M., Sontag, D., & Jiang, X. (2025). Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries (version 1.0.1). PhysioNet. https://doi.org/10.13026/gedc-j464.
Chicago	Hegselmann, Stefan, Shen, Shannon, Gierse, Florian, Agrawal, Monica, Sontag, David, and Xiaoyi Jiang. "Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries" (version 1.0.1). PhysioNet (2025). https://doi.org/10.13026/gedc-j464.
Harvard	Hegselmann, S., Shen, S., Gierse, F., Agrawal, M., Sontag, D., and Jiang, X. (2025) 'Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries' (version 1.0.1), PhysioNet. Available at: https://doi.org/10.13026/gedc-j464.
Vancouver	Hegselmann S, Shen S, Gierse F, Agrawal M, Sontag D, Jiang X. Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries (version 1.0.1). PhysioNet. 2025. Available from: https://doi.org/10.13026/gedc-j464.

Additionally, please cite the original publication:

Hegselmann, S., Shen, S. Z., Gierse, F., Agrawal, M., Sontag, D., & Jiang, X. (2024). A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. arXiv preprint arXiv:2402.15422.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Large language models in healthcare can generate informative patient summaries while reducing the documentation workload of healthcare professionals. However, these models are prone to producing hallucinations, that is, generating unsupported information, which is problematic in the sensitive healthcare domain. To better characterize unsupported facts in medical texts, we developed a rigorous labeling protocol. Following this protocol, two medical experts annotated unsupported facts in 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries. Here, we are releasing two datasets based on these annotations: Hallucinations-MIMIC-DI and Hallucinations-Generated-DI. We find that using these datasets to train on hallucination-free examples effectively reduces hallucinations for both Llama 2 (2.60 to 1.55 hallucinations per summary) and GPT-4 (0.70 to 0.40). Furthermore, we created a preprocessed version of the MIMIC-IV-Notes Discharge Instructions, releasing both a full-context version (MIMIC-IV-Note-Ext-DI) and a version that only uses the Brief Hospital Course for context (MIMIC-IV-Note-Ext-DI-BHC).

Background

Many patients do not understand the events that occurred during their hospitalization and the subsequent actions they need to take [1]. For instance, [2] performed post-discharge interviews and found that only 59.6% of the patients were able to accurately describe their admission diagnosis and 43.9% could fully describe their scheduled follow-up appointments. An improved discharge communication is associated with lower hospital readmission rates and higher adherence to treatment regiment [3]. A potential intervention to improve patient informedness could be patient-oriented summaries that describe all relevant facts in layperson language [4]. However, writing high-quality patient summaries is a difficult and time-consuming task [5] and healthcare workers already face high workloads [6,7].

Large language models (LLMs) have demonstrated strong capabilities on many natural language tasks including medical summarization [8]. However, LLMs are prone to generate unsupported or erroneous facts also referred to as hallucinations [9]. In healthcare, this issue is further aggravated by the fragmented nature of healthcare data as datasets often do not perfectly mimic the data available at the point of care. For example, datasets for medical summarization may not include the full patient history to accompany the written summarization leading to "hallucinations" in the human-written summary. Training or fine-tuning on this data replicates these artifacts. Several techniques for preventing hallucinations have been studied [10]. However, hallucinations can highly vary in complexity escaping automatic detection and making careful human annotation necessary [11]. This also applies to medical summaries [12].

Methods

MIMIC Datasets

First, we created a dataset of doctor-written patient summaries with different contexts that could be used to generate these summaries. We used the MIMIC-IV-Note v2.2 database which includes 331,793 deidentified free-text clinical notes from 145,915 patients admitted to the Beth Israel Deaconess Medical Center in Boston, MA, USA [13,14].

Selecting Patient Summary: We used the Discharge Instructions section of the MIMIC-IV-Note discharge notes as patient summaries.
Data preprocessing: Many summaries contained irrelevant artifacts that could distort the downstream analysis. Hence, we designed a preprocessing pipeline that filtered poor summaries and removed irrelevant content (for details see paper). As a result we kept 100,175 of the original 331,793 discharge notes.
Selecting Context: We considered different contexts for our experiments that serves as information to create a summary. For our experiments, we only chose the Brief Hospital Course (BHC) section as context since it contains the most relevant information about the hospital course written for medical professionals. We chose this shorter context to reduce the effort for the human annotators and to better fit it into the models' context windows. The resulting dataset is named MIMIC-IV-Note-Ext-DI-BHC. We also release a version with all notes prior to the Discharge Instructions as context named MIMIC-IV-Note-Ext-DI.
Selecting Subset for Annotation: To facilitate human annotation of the data we further filtered the data for contexts with a length of at most 4,000 characters and summaries with at least 600 characters yielding MIMIC-IV-Note-Ext-DI-BHC-Anno containing 26,178 entries. This was done to reduce the amount of context to take into account for the annotators and to increase the information in the summaries.

As a result, we will release three datasets of doctor-written patient summaries with the full context, the Brief Hospital Course as context, and a subset of the second to facilitate human annotation: MIMIC-IV-Note-Ext-DI, MIMIC-IV-Note-Ext-DI-BHC, MIMIC-IV-Note-Ext-DI-BHC-Anno

Hallucination Datasets Annotated by Medical Experts

We developed a protocol for labeling token-level errors in medical texts based on [15,16], which is available on Github [17]. We distinguished unsupported, contradicted, incorrect facts. Unsupported facts were further distinguished into nine subcategories. We treated the context (BHC) as the only ground truth about the patient. We chose this approach to reduce the labeling burden as annotators could not be expected to review all notes and structured information of a patient. However, since patient summaries do not only contain patient-specific information, we did allow general medical knowledge and advice even if not explicitly provided in the context (e.g., "Please take your medications as prescribed").

The labeling was carried out by two German medical students in their sixth year. They had completed their second state examination (USMLE Step 2 equivalent) and were working in the hospital. We utilized MedTator for annotation. For annotator training, we used twelve examples. Two examples were used to familiarize with the task and two times five examples were labeled separately and discussed for training. For the final labeling, the annotators worked independently and reached a consensus through discussion. More details can be found in [18].

Annotating Unsupported Facts in Doctor-Written Patient Summaries: We selected 100 random examples from MIMIC-IV-Note-Ext-DI-BHC-Anno and medical experts annotated unsupported facts in the patient summaries yielding the dataset Hallucinations-MIMIC-DI. It is important to note that unsupported facts in doctor-written summaries are common in healthcare practice and usually should not be regarded as errors. Doctors may include information in the summary that was never documented, that was documented outside the considered context (in our case, only the BHC), or that was altered just prior to discharge.
Annotating Hallucinations in Generated Patient Summaries: We chose 20 held-out contexts from MIMIC-IV-Note-Ext-DI-BHC-Anno and used the five models trained for the data-centric hallucination reduction experiments to generate summaries. Again, medical experts annotated hallucinations in these summaries with our protocol yielding Hallucinations-Generated-DI.

Derived Datasets from Hallucinations-MIMIC-DI

Based on Hallucinations-MIMIC-DI, we derived three additional datasets for our data-centric hallucination reduction experiments and qualitative evaluation. Original contains the same examples as Hallucinations-MIMIC-DI, Cleaned contains the examples with hallucinations manually removed or replaced, and Cleaned & Improved contains the examples for which further mistakes and artifacts were corrected.

Data Description

MIMIC Datasets

The datasets are provided as JSONL files with one context-summary pair per line as JSON dictionary with "text" as key for the context and "summary" for the summary.

MIMIC-IV-Note-Ext-DI (/mimic-iv-note-ext-di/dataset/all.json): 100,175 context-summary pairs that were filtered and preprocessed from MIMIC-IV-Note (additional details see paper). The context contains all text before the Discharge Instructions section that was used as patient summary.
MIMIC-IV-Note-Ext-DI-BHC (/mimic-iv-note-ext-di-bhc/dataset/all.json): 100,175 context-summary pairs from MIMIC-IV-Note-Ext-DI with shorter context (Brief Hospital Course).
MIMIC-IV-Note-Ext-DI-BHC-Anno (/mimic-iv-note-ext-di-bhc/dataset/*_4000_600_chars.json): 26,178 context-summary pairs, which are a subset of MIMIC-IV-Note-Ext-DI-BHC with contexts ≤ 4,000 characters and summaries ≥ 600 characters to facilitate human annotation.

Hallucination Datasets Annotated by Medical Experts

The datasets have the same JSONL as the MIMIC datasets (entries for "text" and "summary") containing an additional entry "labels" with the agreed upon hallucination annotations. Each annotation contains a "start" and "end" character of the span, a "length" of characters, and the annotated "text". Also, there is a "label" entry giving one of the eleven labels introduced in our annotation protocol. We also provide the annotations as XML files in the BioC-format. The datasets are in: /hallucination_datasets.

Hallucinations-MIMIC-DI: 100 random context-summary pairs from MIMIC-IV-Note-Ext-DI-BHC-Anno with unsupported facts annotated and agreed upon by two medical experts.
Hallucinations-MIMIC-DI-Valid: 10 random validation context-summary pairs from MIMIC-IV-Note-Ext-DI-BHC-Anno with hallucinations annotated and agreed upon by two medical experts.
Hallucinations-Generated-DI: 100 context-summary pairs based on 20 random contexts from MIMIC-IV-Note-Ext-DI-BHC-Anno and summaries generated with five different models. The file hallucinations_generated_di.xml contains the raw annotations from Medtator with randomized summaries. In hallucinations_generated_di.jsonl, we derandomized the summaries so that the lines 1-20 belong to llama_70b_original, 21-40 to llama_70b_cleaned, 41-60 to gpt4_zero_shot, 61-80 to gpt4_orig, and 81-100 to gpt4_cleaned.

Derived Datasets from Hallucinations-MIMIC-DI

The datasets are provided as JSONL files with one context-summary pair per line as JSON dictionary with "text" as key for the context and "summary" for the summary. The datasets are in: /derived_datasets.

Original: 100 context-summary pairs from Hallucinations-MIMIC-DI.
Cleaned: 100 context-summary pairs from Original with labeled unsupported facts manually removed or replaced.
Cleaned & Improved: 100 context-summary pairs from Cleaned with mistakes and artifacts removed or corrected.

Usage Notes

An example usage in Python are the experiments carried out in the corresponding paper. The code is available on Github [17]. Common use-cases of the published is listed below:

MIMIC Datasets: The MIMIC dataset contains a preprocessed and cleaned version of the Discharge Instructions, hence, they could be a useful starting point for machine learning experiments working with this section of the MIMIC-IV-Note discharge notes. We provide versions with two different contexts.
Hallucination Datasets Annotated by Medical Experts: The datasets contain labels for unsupported facts in 100 doctor-written and hallucinations 100 generated patient summaries. They can be used to evaluate automatic hallucination detection methods and to train automatic approaches for hallucination reduction.
Derived Datasets from Hallucinations-MIMIC-DI: The derived datasets contain the 100 doctor-written summaries with unsupported facts removed and with unsupported facts removed and improved language. They can be used to fine-tune or prompt an LLM with high-quality examples to generate patient summaries with less hallucinations and higher quality. The data can also serve for better evaluation of patient summary generation as it contains higher quality examples.

Limitations

The development of the preprocessed and annotated datasets involved several trade-offs that users should consider when assessing its applicability for research:

Residual Artifacts: Despite extensive cleaning, some problematic examples may persist in the preprocessed versions of the discharge instructions. The filtering process aimed to remove noise and irrelevant content but may not have eliminated all errors present in the original texts.
Selection Bias: The preprocessing pipeline introduced a selection bias. By filtering and selecting only a subset of the original discharge instructions, the resulting dataset may not fully represent the diversity and distribution of the complete hospital data, potentially limiting its generalizability.
Subjective Annotations: For the hallucination labels, a dedicated protocol was developed and applied by two annotators with medical backgrounds. Although consensus was reached, the process is subjective. Different projects may define or interpret hallucinations differently, which could affect the reproducibility and interpretation of the annotations.
Derived Dataset Dependence: The datasets derived by removing unsupported facts rely on the aforementioned hallucination annotations. Therefore, any inconsistencies or biases in the annotation process will directly impact these derived datasets. Additionally, the subsequent language improvement process is subjective and may not meet all project-specific requirements.

Release Notes

Version 1.0.1

Added clarifying instructions about the ordering of the summaries in hallucinations_generated_di.xml and hallucinations_generated_di.jsonl. Thanks to Hiba Ahsan.

Version 1.0.0

First publicly available version of the data that was used in the original paper.

Ethics

This project builds upon the previously published MIMIC-IV-Note v2.2 dataset [13,14]. The approval for this project is based on the original data being de-identified and approved for credentialed distribution.

Acknowledgements

The generated patient summaries were computed on the HPC cluster PALMA II of the University of Münster, subsidised by the DFG (INST 211/667-1).

Conflicts of Interest

None to declare.

References

Kebede S, Shihab HM, Berger ZD, Shah NG, Yeh HC, Brotman DJ. Patients’ Understanding of Their Hospitalizations and Association With Satisfaction. JAMA Internal Medicine. 2014 Oct 1;174(10):1698–700.
Horwitz LI, Moriarty JP, Chen C, Fogerty RL, Brewster UC, Kanade S, et al. Quality of Discharge Practices and Patient Understanding at an Academic Medical Center. JAMA Internal Medicine. 2013 Oct 14;173(18):1715–22.
Becker C, Zumbrunn S, Beck K, Vincent A, Loretz N, Müller J, et al. Interventions to Improve Communication at Hospital Discharge and Rates of Readmission: A Systematic Review and Meta-analysis. JAMA Netw Open. 2021 Aug 27;4(8):e2119346.
Federman A, Sarzynski E, Brach C, Francaviglia P, Jacques J, Jandorf L, et al. Challenges optimizing the after visit summary. Int J Med Inform. 2018 Dec;120:14–9.
Mueller SK, Giannelli K, Boxer R, Schnipper JL. Readability of patient discharge instructions with and without the use of electronically available disease-specific templates. Journal of the American Medical Informatics Association. 2015 Jul 1;22(4):857–63.
Phillips C. Relationships between workload perception, burnout, and intent to leave among medical–surgical nurses. JBI Evidence Implementation. 2020 Jun;18(2):265.
Watson AG, McCoy JV, Mathew J, Gundersen DA, Eisenstein RM. Impact of physician workload on burnout in the emergency department. Psychology, Health & Medicine. 2019 Apr 21;24(4):414–28.
Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, Bluethgen C, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024 Feb 27;1–9.
Maynez J, Narayan S, Bohnet B, McDonald R. On Faithfulness and Factuality in Abstractive Summarization. In: Jurafsky D, Chai J, Schluter N, Tetreault J, editors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. p. 1906–19.
Huang Y, Feng X, Feng X, Qin B. The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey. arXiv preprint arXiv:210414839. 2023 Apr 10; Available from: http://arxiv.org/abs/2104.14839.
Thomson C, Reiter E, Sundararajan B. Evaluating factual accuracy in complex data-to-text. Computer Speech & Language. 2023 May 1;80:101482.
Moramarco F, Papadopoulos Korfiatis A, Perera M, Juric D, Flann J, Reiter E, et al. Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022 May;5739–54.
Johnson A, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes [Internet]. PhysioNet; 2023. Available from: https://physionet.org/content/mimic-iv-note/2.2/.
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000 Jun 13;101(23).
Thomson C, Reiter E. A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. Proceedings of the 13th International Conference on Natural Language Generation. 2020 Dec;158–68.
Thomson C, Reiter E. Generation Challenges: Results of the Accuracy Evaluation Shared Task. arXiv preprint arXiv:210805644. 2021 Aug 15; Available from: http://arxiv.org/abs/2108.05644.
Code for "A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models". Available from: https://github.com/stefanhgm/patient_summaries_with_llms [4/21/2024]
Hegselmann S, Shen Z, Gierse F, Agrawal M, Sontag D, Jiang X. A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. In: Proceedings of the fifth Conference on Health, Inference, and Learning. PMLR; 2024. p. 339–79.