Database Credentialed Access

MIMIC-III-Ext-VeriFact-BHC: Labeled Propositions From Brief Hospital Course Summaries for Long-form Clinical Text Evaluation

Philip Chung Akshay Swaminathan Alex Goodell Yeasul Kim Momsen Reincke Lichy Han Ben Deverett Mohammad Amin Sadeghi Abdel badih El Ariss Marc Ghanem David Seong Andrew Lee Caitlin Coombes Brad Bradshaw Mahir Sufian Hyo Jung Hong Teresa Nguyen Mohammad Rasouli Komal Kamra Mark Burbridge James McAvoy Roya Saffary Stephen Parnell Ma Dev Dash James Xie Ellen Wang Cliff Schmiesing Nigam Shah Nima Aghaeepour

Published: April 9, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Chung, P., Swaminathan, A., Goodell, A., Kim, Y., Reincke, M., Han, L., Deverett, B., Sadeghi, M. A., El Ariss, A. b., Ghanem, M., Seong, D., Lee, A., Coombes, C., Bradshaw, B., Sufian, M., Hong, H. J., Nguyen, T., Rasouli, M., Kamra, K., ... Aghaeepour, N. (2025). MIMIC-III-Ext-VeriFact-BHC: Labeled Propositions From Brief Hospital Course Summaries for Long-form Clinical Text Evaluation (version 1.0.0). PhysioNet. https://doi.org/10.13026/abat-g475.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The VeriFact-BHC dataset is designed to verify the factuality of long-form text written about a patient against their own electronic health record. There is increasing interest in using large language models (LLMs) to generate clinical text in patient care applications, yet this text needs to be evaluated for factual errors and hallucinations prior to committing text to a patient’s permanent medical record. Text written about a patient should be internally consistent with information already known about the patient, such as that stored in their medical records.

VeriFact-BHC contains long-form Brief Hospital Course (BHC) clinical narratives typically found in a discharge summary that have been decomposed into text proposition statements. From 100 patients in the MIMIC-III Clinical Database v1.4, we consider two types of BHC text: a human-written BHC and a LLM-generated BHC. The original human clinician-written BHC is extracted from the discharge summary note. The LLM-generated BHC is composed by a LLM using the patient’s longitudinal clinical notes from the hospital admission. Each BHC is decomposed in two ways: sentence propositions and atomic claim propositions. The remaining electronic health record (EHR) notes for each patient serves as a patient-specific reference of facts that is used by clinicians and VeriFact to assign labels. A total of 13,070 propositions are annotated by multiple clinicians with a ground truth established via majority voting and manual adjudication. Also provided are labels assigned by the VeriFact artificial intelligence system and labels assessing whether propositions are valid from a first-order logic standpoint. The reference EHR for each patient is provided in both machine-readable and PDF formats.

By offering this dataset, we hope to spur further investigation and creation of computational systems for automatic chart review and patient-specific fact verification. We invite the research community to utilize this dataset to develop better methods to guardrail patient-specific LLM-generated clinical text.


Background

Clinical medicine use cases for large language models often involve long-form clinical text generation to create clinical notes, hand-off tools, patient education materials, etc. These long-form text use cases are different from commonly benchmarked medical question-answer (Q&A) datasets. A Q&A task only requires LLMs to produce brief text generations and are straightforward to factually verify. In contrast, long-form text may contain a large number of claims and it is challenging to factually verify all of them.

The VeriFact-BHC dataset focuses on long-form factual verification of clinical text by using the patient’s own EHR as a factual reference [1]. To perform fine-grained factual verification of long-form text, Brief Hospital Course (BHC) clinical narratives in VeriFact-BHC are decomposed into smaller units of analysis we refer to as propositions. These propositions can be individually “Supported”, “Not Supported”, or “Not Addressed” by the patient’s own EHR. “Supported” propositions must be fully supported by the patient’s EHR. “Not Supported” propositions are not supported, partially supported, or contradicted by the EHR. “Not Addressed” propositions are not mentioned in the EHR.

The VeriFact-BHC dataset includes 13,070 sentence and atomic claim propositions derived from human-written and LLM-generated Brief Hospital Course clinical narratives of 100 patients sampled from MIMIC-III Clinical Database v1.4 [2]. The MIMIC-III dataset is a de-identified subset of a real EHR and primarily includes intensive care unit notes, radiology reports, and cardiology reports. It is missing notes from other services one would typically find in a real hospital’s EHR. The LLM-writer can generate a BHC only based on information in notes in the MIMIC-III dataset, but the original clinician authoring the human-written summary had access to additional information sources such as notes from other hospital units, outpatient notes, and bedside patient care discussions. This information asymmetry reflects a real-world difference between LLM-generated and human-written clinical text, and leads to human-written BHC containing a greater fraction of information that cannot be “Supported”.


Methods

Data Source

The MIMIC-III-Ext-Verifact-BHC dataset was derived from the publicly available MIMIC-III Clinical Database v1.4 dataset, focusing specifically on the NOTEEVENTS table to derive the Brief Hospital Course (BHC) narratives, text propositions and reference electronic health record (EHR) in this dataset. The MIMIC-III Clinical Database is a de-identified subset of a real EHR from Beth Israel Deaconess Medical Center in Boston, MA from 2001-2012 [2]. Similar to a real EHR as it contains longitudinal clinical notes from the intensive care unit (ICU), radiology reports, and cardiology reports. However it does not contain all the clinical notes from the EHR and is missing clinical notes from non-ICU hospital ward services, specialty inpatient consulting services, emergency department, or outpatient office visits.

Dataset and Inclusion Criteria

The dataset in this study comprised of a random sample of 100 patients from MIMIC-III that meet the following inclusion criteria:

  1. The patient’s last hospital admission must have a discharge summary with Brief Hospital Course (BHC) section that can be extracted using regular expressions.
  2. The last hospital admission should have at least 2 physician notes other than the discharge summary.
  3. The patient’s Electronic Health Record (EHR) must contain at least 10 total notes prior to the discharge summary, which may be from current admission or prior hospital encounters.

Long-form Text and Reference Creation

Patients may have multiple hospital admissions in their EHR, so the last hospital admission was selected as the hospital admission of interest. For each patient, the discharge summary for the target hospital admission was separated from the rest of the clinical notes. The rest of the clinical notes contain factual knowledge known about the patient at the end of the hospital admission. Thus the rest of the patient’s clinical notes serves as a patient-specific reference that can be used to verify facts written in a BHC.

A human-written BHC is extracted from the discharge summary of the target hospital admission using regular expressions. This long-form human-written narrative represents the current status quo in clinical care where human clinicians obtain data from chart review and care team discussions to generate new clinical text. This narrative was written by a clinician who likely had access to additional data sources than what is present in the reference EHR available in MIMIC-III due to the construction of MIMIC-III and because not all bedside patient care discussions between clinical staff and patients are captured in EHR clinical notes.

Separately, a LLM-written BHC is generated using all of the clinical notes from the target hospital admission in the patient-specific reference corpus. The LLM-written narrative represents a use case where LLMs are used to generate text in clinical care applications. The LLM-writer is prompted to role-play as a simple physician agent. The LLM used is an adaptive weight quantized version of Llama 3.1 70B (hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4) with temperature=0.5 to encourage diversity in writing and default settings for all other parameters [3,4]. Unlike the human clinician authors, the LLM-writer is only able to source information from its own parametric knowledge and from the MIMIC-III dataset.

Long-form Text Decomposition

Thus we have 2 types of BHC summarizing the same hospital admission about the same patient that are written by different authors (human vs. LLM). The long-form BHC narratives for each patient are decomposed into a set of rudimentary statements, which we call propositions, that can be independently evaluated. Two types of propositions are generated: sentence and atomic claims. The following procedure is used to transform BHC text to propositions:

  1. The BHC narrative is split into text chunks of 128 tokens or fewer using recursive semantic parsing. This involves generating an embedding representation for each sliding window of 3 sentences, computing cosine distance between sequential embedding representations, and generating text splits when the embedding representations exceed the 90th percentile of distances within the input text. The dense encoder from the BAAI/bge-m3 model was used to generate embedding representations [5,6].
  2. To obtain sentence propositions, each text chunk was split using the NLTK sentence tokenizer [7]. Each resultant sentence chunk is a proposition.
  3. To obtain atomic claim propositions, each text chunk was passed to an LLM agent tasked with re-expressing the original text as a set of simple sentences that resemble first-order predicate logic statements and encapsulate a single Subject-Object-Predicate relationship [8–11]. The LLM used is hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 with temperature=0.1 since low-temperature sampling tends to result in consistent and high-performing behavior [3,4,12]. Each resultant atomic claim is a proposition.

The result of the long-form text decomposition process is 13,070 propositions derived from the two types of BHC from 100 patients. Propositions are unique relative to each original source document. If the same exact text is extracted as a proposition twice in the same document, it is deduplicated prior to inclusion in this dataset. If the same exact text is extracted as a proposition from different documents, both will be included with different proposition_id to distinguish they originated from different documents.

Data Annotation

Each proposition is annotated by 3 clinicians. Clinicians are provided a PDF copy of the patient’s reference EHR notes and are tasked with annotating whether a proposition is "Supported", "Not Supported", or "Not Addressed" by the patient’s EHR. A total of 25 clinicians participate in this first round of annotation. Each clinician may have a different way of searching for information, interpreting what is written, and judgement threshold for label assignment. A denoised human clinician ground truth set of labels representing the average clinician judgement is created by taking a majority vote of these labels. When all 3 clinicians disagree, the propositions undergo a second round of annotation by two additional independent clinicians. If these clinicians agree, the label is taken as ground truth. If they disagree, a third clinician is involved to help break the tie, and the proposition is manually discussed as a group and adjudicated until consensus is reached. The final human clinician ground truth labels thus represent the combined “average judgement” across 27 clinicians.

Annotations generated by the VeriFact artificial intelligence (AI) system are also provided for each proposition. The VeriFact study examines the ability of multiple VeriFact AI raters to assign labels to each proposition. Also included in the VeriFact-BHC dataset are label assignments for each proposition as annotated by each VeriFact AI rater as well as a text explanation of the label assignment. The corresponding proposition and reference context are also provided. Finally, the VeriFact study examines LLM-as-a-Judge from both the Llama 3 family of models [4] as well as the DeepSeek-R1-Distilled-Llama family of models which involve further training the Llama 3 models via distillation on a dataset generated by DeepSeek-R1 [13] so Llama 3 models inherit the behavior of generating an intermediate reasoning chain before final label assignment. For the R1-Distilled-Llama models, we also provide the generated intermediate reasoning chain.

Finally an analysis of the validity of sentence and atomic claim propositions based on first-order logic definitions was conducted in the VeriFact study [13]. Each proposition was separately annotated to determine whether or not it is an imperative, interrogative, incomplete, or vague statement. The union of these labels was used to determine if a proposition was valid or invalid. These labels are also included in the dataset.


Data Description

The provided dataset includes 13,070 sentence and atomic claim propositions derived from human-written and LLM-generated Brief Hospital Course clinical narratives of 100 patients sampled from MIMIC-III Clinical Database v1.4. Each proposition is provided alongside the brief hospital course (BHC) from which it was derived. Annotations from human clinicians, the VeriFact AI system, and proposition validity are provided in separate files.

File Structure and Contents

The dataset consists of multiple files which are organized in three folders.

  • README.txt: A data dictionary which describes dataset structure, relationship between files, and details about variables and contents in each file.
  • Llama_3.1_LICENSE: A derivative of Llama 3.1 is used to produce LLM-generated BHC and atomic claim propositions. Following license agreement, a copy of the original license is included with this dataset release.
  • propositions
    • propositions.csv.gz: Contains all 13,070 propositions for 100 patients. There are two types of Brief Hospital Course clinical narratives (human-written & LLM-written) and two types of proposition units used for each patient (atomic claim propositions & sentence propositions).
    • human_verdicts.csv.gz: Contains individual human rater labels, adjudication labels, and final human ground truth labels for all propositions.
    • verifact_verdicts.csv.gz: Contains all labels assigned by all VeriFact AI systems presented in the VeriFact manuscript. Each VeriFact system (combination of hyperparameters) assigns a label to each proposition and an explanation for the label assignment.
    • proposition_validity.csv.gz: Contains all proposition validity labels.
  • reference_ehr
    • ehr_noteevents.csv.gz: Machine-readable file with all clinical notes forming the reference EHR for the 100 patients. This is a subset of the MIMIC-III Clinical Database NOTEEVENTS table and includes all of the clinical notes for each of the 100 patients in this dataset with the exception of the final discharge summary note.
    • ehr_noteevents_pdf.tar.gz: Human-readable PDFs of the reference EHRs for 100 patients. These PDFs were used by human clinicians to assign a label to propositions.
    • subject_ids.txt: Comma-separated list of subject_id of the 100 patients included in this dataset.
    • admissions.csv.gz: A subset of the MIMIC-III Clinical Database ADMISSIONS table for the 100 patients included in this dataset.
  • prompts
    • prompts_llm_writer.tar.gz: Prompts used by the LLM-writer to generate the LLM-written Brief Hospital Course clinical narratives.
    • prompts_atomic_claim.tar.gz: Prompts used to decompose long-form text into atomic claims.
    • prompts_proposition_validity.tar.gz: Prompts used to create draft labels for proposition validity classification.

Usage Notes

Previous Uses of the Dataset

This dataset is utilized in VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records. The study used the dataset to evaluate the performance of using an AI evaluator composed of retrieval-augmented generation (RAG) and LLM-as-a-Judge components.

Reuse Potential

This dataset is useful for various future research and practical applications:

  • Performance Benchmarking: Researchers can use this dataset to benchmark different fact verification systems, which may include different machine learning models or information retrieval methods. In particular, this dataset is useful for benchmarking other long-form text evaluation systems.
  • Text Decomposition Research: Since both the complete BHC is provided along with the corresponding propositions, this dataset can be used for future research to study how well long-form BHC is decomposed into propositions and where potential errors can arise.
  • Long-form Text Composition: Since both the complete BHC is provided along with the corresponding propositions, this dataset can be used for grounded text composition tasks where individual propositions are combined into long-form narratives. The resultant long-form narrative can be compared against the original BHC.
  • Search and Information Retrieval: Each proposition has human annotations on whether the information in the proposition is Supported, Not Supported, or Not Addressed by the patient’s EHR. These propositions can be treated as search queries and be used to study the effectiveness of different information retrieval methods from the patient’s EHR.

Known Limitations

Users should be aware of certain limitations:

  • Llama 3.1 License Details: LLM-generated BHC and atomic claim text decompositions are generated with the assistance of a derivative model of Llama 3.1. A copy of the Llama 3.1 Community License Agreement is included along with this dataset. If any outputs derived from Llama models such as the LLM-generated BHC and Atomic Claim Text Decompositions are used to train or improve an AI model, this AI model must include “Llama” at the beginning of the AI model name.
  • Differences Between Human-written BHC and LLM-written BHC: The MIMIC-III dataset contains clinical notes when a patient is admitted to the intensive care unit (ICU) as well as radiology and cardiology reports, but not inpatient notes from other hospital units, services or outpatient clinic visits. The LLM-writer can generate a BHC only based on information in notes in the MIMIC-III dataset, but the original clinician authoring the human-written BHC had access to additional information sources such as notes from other hospital units, outpatient notes, and bedside patient care discussions. Therefore the information content between Human-written and LLM-written BHC may be substantially different due to the nature of the dataset and what kind of information is actually documented in clinical notes by clinicians.
  • Text Decomposition: Human clinician annotations occur at the level of propositions derived from long-form text. There may be many methods or techniques to decompose long-form text into propositions for evaluation. This dataset provides atomic claim and sentence proposition decompositions of each long-form BHC in order to obtain granular human clinician ground truth annotations, but alternative text decompositions will require new ground truth annotations. 
  • Generalizability: This dataset is derived from a single institution’s electronic health record, which may limit generalizability of the findings if other healthcare institutions documentation practices substantially differ. Additionally, the MIMIC-III dataset contains only a subset of clinical notes typically found in a true EHR and is missing many types of notes. Thus performance may be slightly different in real-world EHR systems where notes from all encounters and clinical services are present.

Complementary Resources

The code respository for VeriFact is available at github.com/philipchung/verifact. It contains the Python code used to extract the Human-written BHC, produce the LLM-written BHC, and extract the sentence and atomic claim propositions.


Ethics

The collection of patient information and creation of the research resource was reviewed by the Institutional Review Board at the Beth Israel Deaconess Medical Center, which granted a waiver of informed consent and approved the data-sharing for MIMIC-III Clinical Database. This dataset is a derivative of the MIMIC-III Clinical Database and is shared under the same conditions.


Acknowledgements

This investigation is funded by Foundation for Anesthesia Education and Research (FAER) Mentored Research Training Grant, Anesthesia Research Grant from Stanford Department of Anesthesiology, Perioperative & Pain Medicine, and National Institutes of Health R35GM138353.


Conflicts of Interest

The authors have no conflicts of interest to disclose.


References

  1. Augenstein I, Baldwin T, Cha M, Chakraborty T, Ciampaglia GL, Corney D, et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat Mach Intell [Internet]. 2024 [cited 2024 Sep 16];6:852–63. Available from: https://www.nature.com/articles/s42256-024-00881-z
  2. Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data [Internet]. 2016;3:160035. Available from: http://dx.doi.org/10.1038/sdata.2016.35
  3. hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 · Hugging Face [Internet]. [cited 2024 Dec 7]. Available from: https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
  4. Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, et al. The Llama 3 herd of models [Internet]. arXiv [cs.AI]. 2024 [cited 2024 Sep 18]. Available from: http://arxiv.org/abs/2407.21783
  5. Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Findings of the Association for Computational Linguistics ACL 2024 [Internet]. 2024 [cited 2024 Sep 18]. p. 2318–35. Available from: https://aclanthology.org/2024.findings-acl.137.pdf
  6. BAAI/bge-m3 · Hugging Face [Internet]. [cited 2024 Dec 7]. Available from: https://huggingface.co/BAAI/bge-m3
  7. Bird S, Klein E, Loper E. Natural language processing with python [Internet]. Sebastopol, CA: O’Reilly Media; 2009 [cited 2024 Sep 18]. Available from: https://www.oreilly.com/library/view/natural-language-processing/9780596803346/
  8. Min S, Krishna K, Lyu X, Lewis M, Yih W-T, Koh P, et al. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics; 2023 [cited 2024 Dec 4]. p. 12076–100. Available from: https://aclanthology.org/2023.emnlp-main.741.pdf
  9. Kamoi R, Goyal T, Rodriguez J, Durrett G. WiCE: Real-world entailment for claims in Wikipedia. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics; 2023 [cited 2024 Dec 4]. p. 7561–83. Available from: https://aclanthology.org/2023.emnlp-main.470.pdf
  10. Chen T, Wang H, Chen S, Yu W, Ma K, Zhao X, et al. Dense X retrieval: What retrieval granularity should we use? Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics; 2024 [cited 2024 Dec 4]. p. 15159–77. Available from: https://aclanthology.org/2024.emnlp-main.845.pdf
  11. Wei J, Yang C, Song X, Lu Y, Hu N, Tran D, et al. Long-form factuality in large language models [Internet]. arXiv [cs.CL]. 2024. Available from: http://arxiv.org/abs/2403.18802
  12. Zhang E, Zhu V, Saphra N, Kleiman A, Edelman BL, Tambe M, et al. Transcendence: Generative Models Can Outperform The Experts That Train Them [Internet]. arXiv [cs.LG]. 2024. Available from: http://arxiv.org/abs/2406.11741
  13. DeepSeek-AI, Guo D, Yang D, Zhang H, Song J, Zhang R, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv [cs.CL]. 2025. Available: http://arxiv.org/abs/2501.12948
  14. Russell B. The philosophy of logical atomism [Internet]. 1st Edition. London, England: Routledge; 2009 [cited 2024 Dec 3]. Available from: https://api.taylorfrancis.com/content/books/mono/download?identifierName=doi&identifierValue=10.4324/9780203864777&type=googlepdf

Parent Projects
MIMIC-III-Ext-VeriFact-BHC: Labeled Propositions From Brief Hospital Course Summaries for Long-form Clinical Text Evaluation was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files