Name: EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
Published: July 23, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Seongsu Bae , Daeun Kyung , Jaehee Ryu , Eunbyeol Cho , Gyubok Lee , Sunjun Kweon , Jungwoo Oh , Lei JI , Eric Chang , Tackeun Kim , Edward Choi

Published: July 23, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., Oh, J., JI, L., Chang, E., Kim, T., & Choi, E. (2024). EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/bp5y-qt70.

MLA	Bae, Seongsu, et al. "EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images" (version 1.0.0). PhysioNet (2024), https://doi.org/10.13026/bp5y-qt70.
APA	Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., Oh, J., JI, L., Chang, E., Kim, T., & Choi, E. (2024). EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/bp5y-qt70.
Chicago	Bae, Seongsu, Kyung, Daeun, Ryu, Jaehee, Cho, Eunbyeol, Lee, Gyubok, Kweon, Sunjun, Oh, Jungwoo, JI, Lei, Chang, Eric, Kim, Tackeun, and Edward Choi. "EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images" (version 1.0.0). PhysioNet (2024). https://doi.org/10.13026/bp5y-qt70.
Harvard	Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., Oh, J., JI, L., Chang, E., Kim, T., and Choi, E. (2024) 'EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/bp5y-qt70.
Vancouver	Bae S, Kyung D, Ryu J, Cho E, Lee G, Kweon S, Oh J, JI L, Chang E, Kim T, Choi E. EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images (version 1.0.0). PhysioNet. 2024. Available from: https://doi.org/10.13026/bp5y-qt70.

Additionally, please cite the original publication:

Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., ... & Choi, E. (2024). EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research.

Background

Electronic Health Records (EHRs) are large-scale databases that store the entire medical history of patients, including but not limited to structured medical records, medical images, and clinical text. This wealth of patient information reveals tremendous clinical knowledge about individual patients and cohorts, marking them as an indispensable resource for healthcare professionals in routine clinical practice.

Recent years have seen an upsurge in research [1-12] into question answering (QA) systems for EHRs. These systems are designed to effectively retrieve information from EHRs, each specializing in a different modality within the records. For instance, table-based EHR QA systems [2-6] can easily retrieve specific information from structured databases and answer questions like “Did patient 42 undergo a left heart cardiac catheterization procedure in the last hospital visit?” by executing an SQL query on the relational database. On the other hand, image-based EHR QA (i.e., medical visual question answering) models [8-11] are designed to handle questions related to individual medical images. For instance, given a question such as “List all detected anatomical findings.” along with a patient’s chest radiograph, these models generate a response, thereby serving as an effective aid for radiologists. However, despite their benefits, a main challenge in the current EHR QA systems lies in their focus on a single information modality, overlooking EHRs’ inherently multi-modal nature. To fully utilize EHRs’ potential, it is crucial to develop QA systems capable of seamlessly navigating across these multiple modalities such as “Did patient 42 undergo the left heart cardiac catheterization procedure during the last hospital visit, after the chest X-ray revealed any abnormality in the cardiac silhouette within the same period?”. This capability significantly enhances our ability to build a comprehensive model of a patient’s status, thereby improving the quality of the clinical decision-making process.

However, currently, only one multi-modal EHR QA dataset [12] integrates structured EHRs with clinical text, while the integration of table modalities with imaging modalities, such as chest X-rays (CXR), remains unexplored [13]. Our research seeks to bridge this gap. We introduce EHRXQA, the first multi-modal EHR QA dataset for both table and image modalities. By leveraging uni-modal resources (i.e., data sources and question templates), we integrate patients’ structured databases with their aligned chest X-ray images, thereby creating a comprehensible set of QA pairs covering Image-related, Table-related, and Image+Table-related questions. This has the potential to unlock significant clinical benefits, enhance cross-modal analysis, and catalyze advances in medical research.

Methods

First, we integrate CXR images from MIMIC-CXR [14] and tables from MIMIC-IV [15] into our EHRXQA database. Next, we detail the creation of question templates, and the incorporation of the corresponding SQL/NeuralSQL annotations. Finally, we detail our systematic data generation process employed to build our EHRXQA dataset.

Source Database

To construct a comprehensive EHR database that integrates both table and image modalities, we need uni-modal resources that meet our criteria: (i) publicly accessible; (ii) presence of common patients across datasets; (iii) contain high-quality image annotations. After careful consideration, we strategically select three datasets: MIMIC-IV [15] for table modality, MIMIC-CXR [14] for image modality, and Chest ImaGenome [16] as a high-quality annotated version of MIMIC-CXR. Note that all datasets share a significant number of patient IDs (19,264), while incompatible patient IDs exist due to the varying data collection periods.

Database Construction

CXR Integration into MIMIC-IV: To cross-reference CXR images with structured EHRs (e.g., to find CXR images of patients who have been prescribed a specific drug), an integrated database system is crucial. To achieve this, we developed an image reference table named TB_CXR. This table comprises six columns: subject_id, hadm_id, study_id, image_id, studydatetime, and viewposition, connecting patient-related identifiers with CXR images of MIMIC-CXR. Through this table, patient CXR images can be retrieved alongside other table data (e.g., diagnosis, procedure, and prescriptions) from MIMIC-IV using the subject_id or hadm_id.

Timeframe Adjustment: We condensed the event times in each patient’s records, which originally spanned from 2100 to 2200 due to the de-identification process in MIMIC-IV, to a more realistic timeframe (2100-2105). This adjustment was performed while preserving the integrity of CXR images and individual medical event timelines. To enable relative time expressions like ‘last year’, we set ‘2105-12-31 23:59:00’ as the current time and excluded any records beyond this point. We consider patients without hospital discharge times, due to this exclusion, as currently admitted.

Building Silver/Gold Databases: The Chest ImaGenome dataset includes two types of cohorts based on image information: silver (i.e., machine-generated) and gold (i.e., human-labeled). We selected subsets of patients from each cohort to create two distinct databases: the silver database, comprising 800 patients, and the gold database, comprising 400 patients. These databases are utilized for different purposes: the silver database is used for training and validating the QA dataset, while the gold database is used for testing the QA dataset.

Question Template Construction

We define the scope of our question templates using two key criteria: modality-based and patient-based scopes. The modality-based scope classifies templates into three categories, Image-related, Table-related, and Image+Table-related, depending on the type of data modality they require. The patient-based scope classifies templates according to whether they relate to a single patient, a group of patients, or none (i.e., do not relate to specific patients). To accommodate these scopes with diverse and comprehensive question templates, we employ existing uni-modal question resources, MIMIC-CXR-VQA [17] for image modality and EHRSQL [6] for table modality. Examples of our modality- and patient-based question templates, which illustrate the diversity and complexity of EHRXQA dataset, can be found in Table 1.

Table 1: Sample questions in EHRXQA, categorized by modality-based (Image, Table, Image+Table) and patient-based scope (none, single, group), illustrating our dataset’s diversity and complexity.
Modality-based	Patient-based	Sample Question
Image	single (1 image)	Given the last study of patient 15439, which anatomical finding is associated with the right lower lung zone, pneumothorax or vascular redistribution?
	single (2 images)	Enumerate all diseases that are newly detected based on the last study of patient 19290 in 2103 compared to the previous study.
	single (N images)	How many times has the chest X-ray of patient 18489 shown linear/patchy atelectasis in the left lung on the current hospital visit?
	group	Count the number of patients whose chest X-ray studies this year showed any abnormalities in the mediastinum.
Table	none	What’s the cost of a drug named lopinavir-ritonavir?
	single	Did patient 16164 receive any magnesium lab tests last year?
	group	What was the top three diagnosis that had the highest two year mortality rate?
Image+Table	single	Did a chest X-ray study for patient 15110 reveal any anatomical findings within 2 month after the prescription of hydralazine since 2102?
Image+Table	group	Provide the ids of patients in the 20s whose chest X-ray showed low lung volumes in the right lung this month.

Recognizing the critical role of time expressions in real-world questions in the hospital workplace [6], we further refined our question templates. We adopted the time filter concept from EHRSQL and applied it to all question templates. This enhancement allows our question templates to better meet the specific needs in clinical practice. Note that these time filters can be categorized into three types: 1) [time_filter_global] restricts the time range of interest, such as ‘last year’ or ‘in 2022’; 2) [time_filter_within], incorporating the keyword ‘within’, pinpoints events happening within specific temporal boundaries, such as ‘within the same hospital visit’ or ‘within the same day’; 3) [time_filter_exact] refers to a precise temporal point, such as the ‘last CXR study’ or a specific date and time like ‘2105-12-26 15:00:00’.

Our template construction process included 1) clinical needs across both image and table modalities via consulting a medical expert, 2) grounding our templates in these needs for both CXR images and EHR tables, and 3) ensuring clinical relevance. Note that the entire process of designing templates was validated by a board-certified medical expert from the department of neurosurgery to ensure clinical utility. The following details how we tailored question templates for each modality.

Image-related: Questions related to image modality can be defined as inquiries requiring pixel-level information from CXR images retrieved from EHR, which can aid in analyzing visual diagnoses for individual or cohort patient conditions in real-world medical scenarios. To cater to these queries, we used the 48 MIMIC-CXR-VQA templates (e.g., “List all diseases.”) and integrated with expressions to specify our target images (e.g., “The last study of patient 42”). This integration (e.g.,“Given the last study of patient 42, list all diseases.”) enables retrieval of CXR images from the EHR and subsequent analysis based on natural language requests. We further enhanced the templates focusing on a single patient to include queries that compare two consecutive CXR studies (e.g., “Given the last study of patient 42, are there any newly detected diseases compared to the previous study?”) or multiple studies (e.g., “Has patient 42 had any chest X-ray study indicating any anatomical findings in 2023?”) from the same patient. This process resulted in 168 templates for the image modality.
Table-related: The table modality, a significant part of EHRs, covers questions primarily requiring structured information from EHR tables. These questions relate to patient demographics, diagnoses, procedures, medications, and other clinical details typically recorded in structured EHR formats. EHRSQL, which offers a wealth of questions seeking information from EHR tables, proves to be an invaluable resource in this context. Considering the substantial overlap between the MIMIC-III and MIMIC-IV schemas, we leveraged the question templates from EHRSQL’s MIMIC-III templates, adapting them appropriately to fit the MIMIC-IV schema with minimal modifications. This process resulted in 174 templates for the table modality.
Image+Table-related: In the image+table modality, all templates are designed to require multimodal information from both CXR images and structured data from EHRs. We leveraged both MIMIC-CXR-VQA and EHRSQL templates to build multi-modal question templates. Since we recognize the essential role of temporal analysis in multi-modal medical events, we designed templates to capture three primary scenarios: 1) Co-occurring table and CXR events. (e.g., “On the same visit, did patient 42 receive nitroglycerin and have a CXR showing any abnormality in the cardiac silhouette?”); 2) A CXR event following a table event. (e.g., “After being prescribed nitroglycerin, did patient 42 have a CXR during the same visit revealing any abnormality in the cardiac silhouette?”) 3) A table event following a CXR event. (e.g., “Was patient 42 prescribed nitroglycerin during the same visit after a CXR showed cardiac silhouette abnormalities?”). These templates allow for comprehensive analysis of combined events, the cause-and-effect relationships in CXR diagnosis, and relevant follow-up measures related to the CXR diagnosis. To eliminate confusion arising from overlapping information between the CXR and diagnoses/procedures tables, we ensure that questions explicitly specify when a ‘CXR study’ is necessary. This led to 75 templates for the image+table modality, enabling simulations across diverse scenarios.

SQL/NeuralSQL Annotation

Standard SQL queries are effective for retrieving structured data from EHRs [2,6], such as demographic information or lab results stored in tables. However, they are not designed to handle unstructured data, such as CXR images, which also contain valuable patient information. This limitation prevents us from using SQL to retrieve answers for complex, multi-modal questions that span both structured and unstructured data. To overcome this limitation, we adopt NeuralSQL, which is inspired by the Binder approach [18]. NeuralSQL acts as an executable representation, extending SQL’s capabilities to process unstructured image data. NeuralSQL utilizes a pretrained neural model to extract features from medical images, turning them into a structured format suitable for SQL queries.

For Table-related question templates, we utilize the SQL annotations provided by EHRSQL and modify them to be compatible with the MIMIC-IV schema. For question templates related to Image or Image+Table, we annotate them using NeuralSQL representation. The entire SQL/NeuralSQL annotation process was manually undertaken by four graduate students over a span of two months, involving iterative revisions. During this process, the students transformed question templates into their corresponding SQL or NeuralSQL formats.

Data Generation

The question generation process begins with choosing a template at Stage 0, followed by a four-step systematic process (Stages 1-4) that specifies semantics of the template. These steps involve the sampling of visual value (Stage 1), operation value (Stage 2), time template (Stage 3), and condition value (Stage 4). In Stage 1, we augment the question with visual values by filling in object, attribute, and category slots, tailored specifically for CXR images. Stage 2 involves sampling operation values (e.g., 20s) from a predefined set of options such as [age_group] = (20s, 30s, 40s, 50s, 60 or above), which are independent of the database schema or records. Stage 3 incorporates time templates, translated into natural language expressions to establish a temporal context within the questions. Lastly, Stage 4 incorporates condition value sampling, filling placeholders such as {gender} and {year} to provide context-specific conditions to the question.

The corresponding SQL/NeuralSQL query also contains these slots, filled with the same values during the question creation process, thereby completing the (Question, SQL/NeuralSQL) pair. These (Question, SQL/NeuralSQL) pairs are only added to the data pool if the sampled SQL/NeuralSQL query yields a valid answer when executed. To enhance linguistic diversity, we use GPT-4 [19,20] to paraphrase each question. These paraphrases are then manually reviewed by our team to ensure quality.

Data Description

The EHRXQA dataset consists of a total of 46,152 samples, utilizing 417 templates. These templates are organized into three categories based on the type of data modality they require: Image-related, Table-related, and Image+Table-related. Each category is further divided based on the patient scope, according to whether they pertain to a single patient, a group of patients, or are not related to specific patients at all. The dataset includes 16,366 image-related samples, 16,529 table-related samples, and 13,257 samples that involve both images and tables. Detailed descriptions of each category can be found in the 'Question Template Construction' section.

Data statistics

Table 2: Overall statistics of EHRXQA including the number of samples for each modality.
	train	valid	test
Image-related QA	12,860	1,838	1,668
Table-related QA	12,961	1,852	1,716
Image+Table-related QA	10,353	1,480	1,424
Total # of samples	36,174	5,170	4,808

Table 3: Detailed statistics of EHRXQA
modality-based	patient-based	train	valid	test
Image	single (1 image)	6,615 (18.3%)	945 (18.3%)	840 (17.5%)
	single (2 images)	3,410 (9.4%)	488 (9.4%)	468 (9.7%)
	single (N images)	1,890 (5.2%)	279 (5.2%)	240 (5.0%)
	group	945 (2.6%)	135 (2.6%)	120 (2.5%)
Table	none	396 (1.1%)	54 (1.0%)	50 (1.0%)
	single	8,219 (22.7%)	1,151 (22.3%)	1,080 (22.5%)
	group	4,346 (12.0%)	647 (12.5%)	586 (12.2%)
Image+Table	single	9,517 (26.3%)	1,362 (26.3%)	1,210 (25.2%)
Image+Table	group	836 (2.3%)	118 (2.3%)	214 (4.5%)

Files and Structure

We introduce two databases from the Chest ImaGenome dataset: silver (machine-generated) and gold (human-labeled), comprising 800 and 400 patients respectively. Each database is composed of multiple CSV files representing different tables. Selected subsets from each cohort, these databases serve distinct roles: the silver database for training and validation, and the gold database for testing the QA dataset. Our final dataset, constructed from these databases, is divided into training, validation, and test QA sets, each provided in JSON format (train.json, valid.json, test.json).

Directory Structure:

ehrxqa
├── database
│   ├── silver
│   │   ├── admissions.csv
│   │   ├── chartevents.csv
│   │   ├── cost.csv
│   │   ├── d_icd_diagnoses.csv
│   │   ├── d_icd_procedures.csv
│   │   ├── d_items.csv
│   │   ├── d_labitems.csv
│   │   ├── diagnoses_icd.csv
│   │   ├── icustays.csv
│   │   ├── inputevents.csv
│   │   ├── labevents.csv
│   │   ├── microbiologyevents.csv
│   │   ├── mimic_iv_cxr.sql  # SQL schema file
│   │   ├── outputevents.csv
│   │   ├── patients.csv
│   │   ├── prescriptions.csv
│   │   ├── procedures_icd.csv
│   │   ├── tb_cxr.csv
│   │   └── transfers.csv
│   └── gold
│       ├── admissions.csv
│       ├── chartevents.csv
│       ├── cost.csv
│       ├── d_icd_diagnoses.csv
│       ├── d_icd_procedures.csv
│       ├── d_items.csv
│       ├── d_labitems.csv
│       ├── diagnoses_icd.csv
│       ├── icustays.csv
│       ├── inputevents.csv
│       ├── labevents.csv
│       ├── microbiologyevents.csv
│       ├── mimic_iv_cxr.sql  # SQL schema file
│       ├── outputevents.csv
│       ├── patients.csv
│       ├── prescriptions.csv
│       ├── procedures_icd.csv
│       ├── tb_cxr.csv
│       └── transfers.csv
└── dataset
    ├── train.json
    ├── valid.json
    └── test.json

File Format and Contents:

Database: The schema definitions for the EHRXQA dataset are stored in SQL files (mimic_iv_cxr.sql) within the silver and gold directories. These files contain SQL commands to create and structure the database tables but do not store patient records. The electronic health records (EHRs) are stored as individual CSV files in the respective silver and gold directories. Each directory contains 18 CSV files, corresponding to 18 tables: 17 from MIMIC-IV (admissions, chartevents, cost, d_icd_diagnoses, d_icd_procedures, d_items, d_labitems, diagnoses_icd, icustays, inputevents, labevents, microbiologyevents, outputevents, patients, prescriptions, procedures_icd, transfers) and one from MIMIC-CXR (tb_cxr). The tb_cxr table serves as a bridge between the patient records and the corresponding chest X-ray images stored in the MIMIC-CXR-JPG image database.

Dataset: The QA samples in the EHRXQA dataset are stored in individual JSON files (train.json, valid.json, test.json). Each file contains a list of Python dictionaries, with each dictionary representing a single QA sample. The keys in each dictionary are as follows:

db_id: A string representing the corresponding database ID. (i.e., “mimic_iv_cxr”)
split: A string indicating which subset of the dataset (train, valid, or test) the sample belongs to.
id: A unique identifier for each instance in the dataset.
question: A paraphrased version of the question.
template: The final question template created by injecting real database values into the tag. This template represents the fully specified and contextualized form of the question.
query: The corresponding NeuralSQL/SQL query for the question.
value: Specific key-value pairs relevant to the question, sampled from the database.
q_tag: The initial sampled question template. This serves as the foundational structure for the question.
t_tag: Sampled time templates used to provide temporal context and specificity to the question.
o_tag: Sampled operational values for the query, often encompassing numerical or computational aspects required for forming the question.
v_tag: Sampled visual values, which include elements like objects, categories, attributes, and comparisons, adding further details to the question.
tag: A comprehensive tag that synthesizes the enhanced q_tag with additional elements (t_tag, o_tag, v_tag). This represents an intermediate, more specified version of the question template before the final template is formed.
para_type: The source of the paraphrase, either from a general machine-generated approach or specifically from GPT-4.
is_impossible: A boolean indicating whether the question is answerable based on the dataset.
answer: A list of strings containing the answers. If a question yields no answer, the answer list will be empty. (Note: In the provided example (documentation), this field is represented as [...] to protect patient privacy.)

To be specific, here is the example instance:

{
    "db_id": "mimic_iv_cxr",
    "split": "train",
    "id": 0,
    "question": "how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?",
    "template": "how many days have passed since the last time patient 18679317 had a chest x-ray study indicating any anatomicalfinding in 2105?",
    "query": "select 1 * ( strftime('%j',current_time) - strftime('%j',t1.studydatetime) ) from ( select tb_cxr.study_id, tb_cxr.studydatetime from tb_cxr where tb_cxr.study_id in ( select distinct tb_cxr.study_id from tb_cxr where tb_cxr.subject_id = 18679317 and strftime('%y',tb_cxr.studydatetime) = '2105' ) ) as t1 where func_vqa(\"is the chest x-ray depicting any anatomical findings?\", t1.study_id) = true",
    "value": {"patient_id": 18679317},
    "q_tag": "how many [unit_count] have passed since the [time_filter_exact1] time patient {patient_id} had a chest x-ray study indicating any ${category} [time_filter_global1]?",
    "t_tag": ["abs-year-in", "", "", "exact-last", ""],
    "o_tag": {"unit_count": {"nlq": "days", "sql": "1 * ", "type": "days", "sql_pattern": "[unit_count]"}},
    "v_tag": {"object": [], "category": ["anatomicalfinding"], "attribute": []},
    "tag": "how many [unit_count:days] have passed since the [time_filter_exact1:exact-last] time patient {patient_id} had a chest x-ray study indicating any anatomicalfinding [time_filter_global1:abs-year-in]?",
    "para_type": "machine",
    "is_impossible": False,
    "answer": [...]
}

Usage Notes

Dataset Utility

The EHRXQA dataset stands out as a pioneering multi-modal EHR QA dataset that incorporates both tabular and imaging data. It is designed to address the complexities and demands of various clinical scenarios in healthcare, thereby supporting a wide range of uses in healthcare and artificial intelligence, such as:

Multi-Modal EHR QA Benchmark [17,22]: The EHRXQA dataset establishes a comprehensive benchmark for developing and testing multi-modal EHR QA methods. It plays a crucial role in creating an AI framework dedicated to analyzing multi-modal electronic health records. By providing a robust benchmark, it significantly enhances the ability to analyze data across different modalities, thereby driving substantial progress in medical research.
Instruction Dataset for Medical Vision-and-Language Models (VLMs) [21,22]: The EHRXQA dataset serves as an invaluable resource for instruction tuning of VLMs in the healthcare domain. It aids in the development of advanced AI systems that are proficient in interpreting medical imaging and utilizing EHR data more effectively. These systems facilitate the automation of data analysis within EHR databases, marking a critical step toward integrating AI into clinical decision-making processes, thus improving efficiency and accuracy in medical settings.
Facilitating Multi-Modal EHR Research: The EHRXQA dataset enables researchers to explore and develop novel techniques for multi-modal EHR analysis. By providing a diverse and comprehensive dataset, it encourages the development of innovative approaches to EHR data processing, such as cross-modal information retrieval, multi-modal representation learning, and multi-modal fusion. This, in turn, can lead to improved patient care and clinical outcomes.

Known Limitations

Though we have carefully designed the dataset, several limitations exist: 1) Since our dataset is based on the MIMIC database, it potentially limits its generalizability. 2) Due to the constrained label scope of the Chest ImaGenome, our dataset lacks the capability to address more detailed visual questions, such as identifying specific tumor sizes from chest X-rays. 3) Unlike EHRSQL, our model does not include unanswerable questions, an aspect that, if addressed, could enhance our model's comprehensiveness and applicability.

GitHub Repository for this Project

The dataset's creation code is accessible on the GitHub repository [23].

Release Notes

This is version 1.0.0 of the EHRXQA dataset. For any questions or concerns regarding this dataset, please feel free to reach out to us (seongsu@kaist.ac.kr or kyungdaeun@kaist.ac.kr). We appreciate your interest and are eager to assist.

Ethics

The authors have no ethical concerns to declare.

Acknowledgements

This work was (partially) supported by Microsoft Research Asia, Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No.2019-0-00075, RS-2022-00155958), National Research Foundation of Korea (NRF) grant (NRF-2020H1D3A2A03100945), and the Korea Health Industry Development Institute (KHIDI) grant (No.HR21C0198), funded by the Korea government (MSIT, MOHW).

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Pampari, A., Raghavan, P., Liang, J., & Peng, J. (2018). emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
Wang, P., Shi, T., & Reddy, C. K. (2020, April). Text-to-SQL generation for question answering on electronic medical records. In Proceedings of The Web Conference 2020 (pp. 350-361).
Bae, S., Kim, D., Kim, J., & Choi, E. (2021, November). Question answering for complex electronic health records database using unified encoder-decoder architecture. In Machine Learning for Health (pp. 13-25). PMLR.
Raghavan, P., Liang, J. J., Mahajan, D., Chandra, R., & Szolovits, P. (2021). emrkbqa: A clinical knowledge-base question answering dataset. Association for Computational Linguistics.
Lehman, E., Lialin, V., Legaspi, K. Y., Sy, A. J. R., Pile, P. T. S., Alberto, N. R. I., ... & Szolovits, P. (2022). Learning to ask like a physician. arXiv preprint arXiv:2206.02696.
Lee, G., Hwang, H., Bae, S., Kwon, Y., Shin, W., Yang, S., ... & Choi, E. (2022). Ehrsql: A practical text-to-sql benchmark for electronic health records. Advances in Neural Information Processing Systems, 35, 15589-15601.
Bardhan, J., Roberts, K., & Wang, D. Z. (2023). Question Answering for Electronic Health Records: A Scoping Review of datasets and models. arXiv preprint arXiv:2310.08759.
Hu, X., Gu, L., Kobayashi, K., An, Q., Chen, Q., Lu, Z., ... & Zhu, Y. (2023). Interpretable medical image visual question answering via multi-modal relationship graph learning. arXiv preprint arXiv:2302.09636.
Huang, J., Chen, Y., Li, Y., Yang, Z., Gong, X., Wang, F. L., ... & Liu, W. (2023). Medical knowledge-based network for patient-oriented visual question answering. Information Processing & Management, 60(2), 103241.
Huang, Y., Wang, X., Liu, F., & Huang, G. (2022, July). OVQA: A clinically generated visual question answering dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2924-2938).
Kovaleva, O., Shivade, C., Kashyap, S., Kanjaria, K., Wu, J., Ballah, D., ... & Mukherjee, V. M. (2020, July). Towards visual dialog for radiology. In Proceedings of the 19th SIGBioMed workshop on biomedical language processing (pp. 60-69).
Bardhan, J., Colas, A., Roberts, K., & Wang, D. Z. (2022). Drugehrqa: A question answering dataset on structured and unstructured electronic health records for medicine related queries. arXiv preprint arXiv:2205.01290.
Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., ... & Ge, Z. (2023). Medical visual question answering: A survey. Artificial Intelligence in Medicine, 102611.
Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., ... & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 317.
Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., ... & Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific data, 10(1), 1.
Wu, J. T., Agu, N. N., Lourentzou, I., Sharma, A., Paguio, J. A., Yao, J. S., ... & Moradi, M. (2021). Chest imagenome dataset for clinical reasoning. arXiv preprint arXiv:2108.00316.
Bae, S., Kyung, D., Ryu, J., Cho, E., Lee, G., Kweon, S., ... & Choi, E. (2024). EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36.
Cheng, Z., Xie, T., Shi, P., Li, C., Nadkarni, R., Hu, Y., ... & Yu, T. (2022). Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875.
Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Chen, Z., Varma, M., Delbrouck, J. B., Paschali, M., Blankemeier, L., Van Veen, D., ... & Langlotz, C. (2024). CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation. arXiv preprint arXiv:2401.12208.
Kang, S., Kim, D., Kim, J., Lee, H. K., & Hwang, S. J. (2024). WoLF: Large Language Model Framework for CXR Understanding. arXiv preprint arXiv:2403.15456.
EHRXQA. Available from: https://github.com/baeseongsu/ehrxqa [03.14.2024]