Name: MIMIC-IV-Ext-DiReCT
Published: Jan. 21, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Restricted Access

Bowen Wang , Jiuyang Chang , Yiming Qian

Published: Jan. 21, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Wang, B., Chang, J., & Qian, Y. (2025). MIMIC-IV-Ext-DiReCT (version 1.0.0). PhysioNet. https://doi.org/10.13026/yf96-kc87.

MLA	Wang, Bowen, et al. "MIMIC-IV-Ext-DiReCT" (version 1.0.0). PhysioNet (2025), https://doi.org/10.13026/yf96-kc87.
APA	Wang, B., Chang, J., & Qian, Y. (2025). MIMIC-IV-Ext-DiReCT (version 1.0.0). PhysioNet. https://doi.org/10.13026/yf96-kc87.
Chicago	Wang, Bowen, Chang, Jiuyang, and Yiming Qian. "MIMIC-IV-Ext-DiReCT" (version 1.0.0). PhysioNet (2025). https://doi.org/10.13026/yf96-kc87.
Harvard	Wang, B., Chang, J., and Qian, Y. (2025) 'MIMIC-IV-Ext-DiReCT' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/yf96-kc87.
Vancouver	Wang B, Chang J, Qian Y. MIMIC-IV-Ext-DiReCT (version 1.0.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/yf96-kc87.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Large language models (LLMs) have recently demonstrated remarkable capabilities across a broad spectrum of tasks and applications, including the medical field. Models like GPT-4 excel in medical question answering but encounter challenges in interpretability when managing complex tasks in real clinical settings. To address this, we introduce the Diagnostic Reasoning Dataset for Clinical Notes (DiReCT), designed to evaluate the reasoning ability and interpretability of LLMs compared to human doctors. The dataset comprises 511 clinical notes (sourced from MIMIC-IV), each meticulously annotated by physicians, detailing the diagnostic reasoning process from initial observations to the final diagnosis.

Background

Recent advancements in large language models (LLMs) have opened new possibilities and challenges for a wide range of natural language processing (NLP) tasks [1]. In the medical domain, these models have shown remarkable prowess, particularly in medical question answering (QA). Leading-edge models, such as GPT-4, exhibit profound proficiency in understanding and generating text [2].

Despite these advancements, interpretability remains crucial, especially in medical NLP tasks [3]. Real clinical settings can be more complex than typical QA tasks [4]. A diagnosis requires comprehending and combining various information, such as health records, physical examinations, and laboratory tests, to reason out possible diseases step-by-step following established guidelines. To evaluate LLMs' support for diagnosis in a more realistic setting, we propose a Diagnostic Reasoning dataset for Clinical noTes (DiReCT). This task involves predicting the diagnosis from a clinical note of a patient, which is a collection of various medical records written in natural language. Clinical notes used in DiReCT are stored in the SOAP format [5], consisting of four sections: Subjective, Objective, Assessment, and Plan. A clinical note also includes a primary discharge diagnosis (PDD).

DiReCT's clinical notes are sourced from the MIMIC-IV dataset [6]. Each note contains clinical data for a patient. To construct DiReCT, we curated a subset of 511 notes whose PDDs fall within one of 25 disease categories across 5 medical domains (one disease category may have several PDDs) [7].

Methods

DiReCT comprises 511 clinical notes, each meticulously annotated by professional physicians. These physicians identify the specific texts or observations within each note that leads to a diagnosis, providing detailed explanations. Each annotation is organized into an entailment tree, progressing from observations to final PPD. Additionally, the dataset includes a diagnostic knowledge graph based on established diagnostic guidelines, enhancing the consistency of annotations.

We first constructed 25 sub-graphs for each disease category. A sub-graph record the diagnostic procedure of a disease and is saved as JSON file. We constructed it following existing guideline and physician's experience. The knowledge graph is attached with the dataset. We also provided detailed instruction of constructing knowledge graph in our arXiv paper [7].

In our task, a note is an excerpt of 6 clinical data in the subjective and objective sections : chief complaint, history of present illness, past medical history, family history, physical exam, and pertinent results. We excluded data, such as review system and social history, because they are often missing in the original clinical notes and are less relevant to the diagnosis. All clinical notes in DiReCT are related to only one PDD (one leaf diagnosis defined in knowledge graph), and there is no secondary discharge diagnosis. We manually removed any descriptions that disclose the PDD in the note. Before annotation, we manually saved the above six clinical data separately from raw text into a JSON file with key names "input1"-"input6". It is realized by out annotation software developed for windows.

This annotation process was carried out by 9 clinical physicians and subsequently verified for accuracy and completeness by three senior medical experts. To ensure data security, all annotations are implemented on a non-networked Windows computer (data cannot be copied or transferred). Annotators need to strictly follow the knowledge graph for annotation. They are ask to find the observation (disease finding) from the clinical note and provide the rationale why it causes the disease during diagnostic procedure. We defined such an operation as a deduction (we also introduce annotation details in our arXiv paper [7]). If a note does not provide sufficient observations for the PDD (which may happen when a certain test is omitted), the annotators were asked to add plausible observations to the note. This choice compromises the fidelity of our dataset to the original clinical notes, but we chose it for the completeness of the dataset. The final annotation for a note is a JSON file produced by our software. The detailed annotation process and annotation software is available through our arXiv paper [7].

Data Description

Annotated Data

We store the annotated JSON files in folders named after the disease categories and PDD. Each JSON file record the annotated diagnostic procedure for a PDD. After unzipping the samples.rar file, the data is formulated as following:

-samples
    - Disease Category 1
          - PDD Category 1
                 - note_1.json
                 - note_2.json
                 - note_3.json
                 ...
          - PDD Category 2
          ...
    - Disease Category 2
    - Disease Category 3
    ...

By reading a JSON file, our annotation software can demonstrate the annotation results. We also provided the python code on GitHub [8] to processing an annotated JSON file and reconstructed it with a tree structure . The following code show the extraction for a synthetic sample (Not from MIMIC).

from utils.data_analysis import cal_a_json, deduction_assemble

root = "samples/Stroke/sample1.json"
record_node, input_content, chain = cal_a_json(root)

record_node: A dictionary for all nodes in our annotation with node index as key. Each node is also saved as a dictionary where
"content" record the content of the node.
"type" show the node annotation type, e.g., "Input" as observations, "Cause" as rationale, and "Intermedia" as diagnosis.
"connection" gives the children node's key (if no child, it is the leaf diagnostic node).
"upper" gives the parent node's key (if no parent, it is the observation node).
input_content: A dictionary saves original clinical note from "input1"-"Chief Complaint" to "input6"-"Pertinent Results"
chain: A list structure saves the diagnostic procedure in order (from suspected to one PDD).

We also provide deduction_assemble() organizes all nodes and return the all deductions as {o: [z,r,d]...}.

GT = deduction_assemble(record_node)

o: extracted observation from raw text.
d: name of the diagnosis.
z: the rationale to explain why an observation can be related to a diagnosis d.
r: the part (from one of input1-6) of the clinical note where o is extracted.

Knowledge Graph

After unzipping the diagnostic_kg.rar file, the data is formulated as following:

-diagnostic_kg
    - Disease Category 1.json
    - Disease Category 2.json
    - Disease Category 3.json
    ...

The knowledge graph for each disease category is saved as JSON file in "diagnostic_kg" folder. Key of "diagnostic" represent the diagnostic procedure (in a tree structure), from a suspected diagnosis a disease to the final diagnosis. Key of "knowledge" records the premises for each diagnosis. Note that each premise is separated with ";".

A sub-graph sample for Heart Failure is shown as following:

{"diagnostic": 
    {"Suspected Heart Failure": 
        {"Strongly Suspected Heart Failure": 
            {"Heart Failure": 
                {"HFrEF": [], 
                 "HFmrEF": [], 
                 "HFpEF": []}}}},
"knowledge": 
    {"Suspected Heart Failure": 
        {"Risk Factors": "CAD; Hypertension; Valve disease; Arrhythmias; CMPs; Congenital heart disease, Infective; Drug-induced; Infiltrative; Storage disorders; Endomyocardial disease; Pericardial disease; Metabolic; Neuromuscular disease; etc.", 
         "Symptoms": "Breathlessness; Orthopnoea; Paroxysmal nocturnal dyspnoea; Reduced exercise tolerance; Fatigue; tiredness; increased time to recover after exercise; Ankle swelling; Nocturnal cough; Wheezing; Bloated feeling; Loss of appetite; Confusion (especially in the elderly); Depression; Palpitation; Dizziness; Syncope.; etc.", 
         "Signs": "Elevated jugular venous pressure; Hepatojugular reflux; Third heart sound (gallop rhythm); Laterally displaced apical impulse; Weight gain (>2 kg/week); Weight loss (in advanced HF); Tissue wasting (cachexia); Cardiac murmur; Peripheral edema (ankle, sacral, scrotal); Pulmonary crepitations; Pleural effusion; Tachycardia; Irregular pulse; Tachypnoea; Cheyne-Stokes respiration; Hepatomegaly; Ascites; Cold extremities; Oliguria;  Narrow pulse pressure."}, 
     "Strongly Suspected Heart Failure": "NT-proBNP > 125 pg/mLor BNP > 35 pg/mL\n", 
     "Heart Failure": "Abnormal findings from echocardiography:LV mass index ≥ 95 g/m2 (Female), ≥ 115 g/m2 (Male); Relative wall thickness >0.42, LA volume index>34 mL/m2, E/e' ratio at rest >9, PA systolic pressure >35 mmHg; TR velocity at rest >2.8 m/s, etc.", 
     "HFrEF": "LVEF<40%", 
     "HFmrEF": "LVEF41~49%", 
     "HFpEF": "LVEF>50%"}}

Usage Notes

We introduced DiReCT as the pioneering benchmark for assessing the diagnostic reasoning capabilities of LLMs with interpretability, by incorporating external knowledge in the form of a graph. Our evaluations [7] highlight a significant gap between the performance of current advanced LLMs and human experts, emphasizing the urgent need for AI models capable of reliable and interpretable reasoning in clinical settings.

DiReCT includes only a subset of disease categories and considers a single PDD, excluding inter-diagnostic relationships due to their inherent complexity—a challenge even for human doctors. Our dataset is designed solely for model evaluation and not for clinical use.

Ethics

Utilizing real-world EHRs, even in a de-identified form, poses inherent risks to patient privacy. Therefore, it is essential to implement rigorous data protection and privacy measures to safeguard sensitive information, in accordance with regulations such as HIPAA. We strictly adhere to the Data Use Agreement of the MIMIC dataset, ensuring that the data is not shared with any third parties. All experiments are implement on a private server. The access to GPT is also a private version.

AI models are susceptible to replicating and even intensifying the biases inherent in their training data. These biases, if not addressed, can have profound implications, particularly in sensitive domains such as healthcare. Unconscious biases in healthcare systems can result in significant disparities in the quality of care and health outcomes among different demographic groups. Therefore, it is imperative to rigorously examine AI models for potential biases and implement robust mechanisms for ongoing monitoring and evaluation. This involves analyzing the model's performance across various demographic groups, identifying any disparities, and making necessary adjustments to ensure equitable treatment for all. Continual vigilance and proactive measures are essential to mitigate the risk of biased decision-making and to uphold the principles of fairness and justice in AI-driven healthcare solutions.

Acknowledgements

This work was supported by World Premier International Research Center Initiative (WPI), MEXT, Japan. This work is also supported by JSPS KAKENHI 24K20795, JST ACT-X Grant Number JPMJAX24C8, and Dalian Haichuang Project for Advanced Talents.

Conflicts of Interest

The author(s) have no conflicts of interest to declare.

References

Min B, Ross H, Sulem E, Ben Veyseh AP, Nguyen TH, Sainz O, et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput Surv. 2023;56(2):1–40.
Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, Löser A, et al. Medalpaca—an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:2304.08247. 2023.
Liévin V, Hother CE, Motzfeldt AG, Winther O. Can large language models reason about medical questions? Patterns. 2024;5(3).
Gao Y, Li R, Caskey J, Dligach D, Miller T, Churpek MM, et al. Leveraging a medical knowledge graph into large language models for diagnosis prediction. arXiv preprint arXiv:2308.14321. 2023.
Weed LL. Medical records, medical education, and patient care: the problem-oriented record as a basic tool. Cleveland (OH): Press of Case Western Reserve University; 1970. ISBN: 9780815191889.
Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.
Wang B, Chang J, Qian Y, Chen G, Chen J, Jiang Z, et al. DiReCT: Diagnostic reasoning for clinical notes via large language models. arXiv preprint arXiv:2408.01933. 2024.
DiReCT: Diagnostic reasoning for clinical notes via large language models [Internet]. Available from: https://github.com/wbw520/DiReCT