Name: Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images
Published: Jan. 21, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Xinyue Hu , Lin Gu , Kazuma Kobayashi , liangchen liu , Mengliang Zhang , Tatsuya Harada , Ronald Summers , Yingying Zhu

Published: Jan. 21, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Hu, X., Gu, L., Kobayashi, K., liu, l., Zhang, M., Harada, T., Summers, R., & Zhu, Y. (2025). Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/1pm5-hy02.

MLA	Hu, Xinyue, et al. "Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images" (version 1.0.0). PhysioNet (2025), https://doi.org/10.13026/1pm5-hy02.
APA	Hu, X., Gu, L., Kobayashi, K., liu, l., Zhang, M., Harada, T., Summers, R., & Zhu, Y. (2025). Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/1pm5-hy02.
Chicago	Hu, Xinyue, Gu, Lin, Kobayashi, Kazuma, liu, liangchen, Zhang, Mengliang, Harada, Tatsuya, Summers, Ronald, and Yingying Zhu. "Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images" (version 1.0.0). PhysioNet (2025). https://doi.org/10.13026/1pm5-hy02.
Harvard	Hu, X., Gu, L., Kobayashi, K., liu, l., Zhang, M., Harada, T., Summers, R., and Zhu, Y. (2025) 'Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/1pm5-hy02.
Vancouver	Hu X, Gu L, Kobayashi K, liu l, Zhang M, Harada T, Summers R, Zhu Y. Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/1pm5-hy02.

Additionally, please cite the original publication:

Hu X, Gu L, Kobayashi K, Liu L, Zhang M, Harada T, Summers RM, Zhu Y. Interpretable medical image visual question answering via multi-modal relationship graph learning. Medical Image Analysis. 2024 Oct 1;97:103279.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Medical Visual Question Answering (VQA) is an important task in medical multi-modal Large Language Models (LLMs), aiming to answer clinically relevant questions regarding input medical images. This technique has the potential to improve the efficiency of medical professionals while relieving the burden on the public health system, particularly in resource-poor countries. However, existing medical VQA datasets are small and only contain simple questions (equivalent to classification tasks), which lack semantic reasoning and clinical knowledge. Our previous work proposed a clinical knowledge-driven image difference VQA benchmark using a rule-based approach. However, given the same large-scale breadth of information coverage, the rule-based approach shows an 85% error rate on extracted labels. We trained an LLM method to extract labels with 62% increased accuracy. We also comprehensively evaluated our labels with 2 clinical experts on 100 samples to help us fine-tune the LLM. Based on the trained LLM model, we proposed a large-scale medical VQA dataset, Medical-CXR-VQA, derived from the MIMIC-CXR dataset and comprises 780,014 question-answer pairs, categorized into six types: abnormality (190,525 pairs), location (104,680 pairs), type (69,486 pairs), level (111,715 pairs), view (92,048 pairs), and presence (211,560 pairs).

Background

Medical Visual Question Answering (VQA) is an important task in medical multi-modal Large Language Models (LLMs), aiming to answer clinically relevant questions related to medical images. This is a challenging task that requires both medical image diagnosis and natural language understanding. Medical VQA can provide clinicians with a "second opinion" in interpreting medical images and decrease the risk of misdiagnosis [1]. It also has the potential to relieve the burden on radiologists by partially taking over their expert consultant role to answer questions from physicians and patients, preventing the disruption of their workflow and improving efficiency [2].

Multi-modal LLMs can be utilized to perform these tasks, which can assist in reducing global health inequalities in low- and middle-income countries. For example, when interpreting complex cases, the second opinion provided by the medical VQA system may significantly enhance the junior clinicians' confidence when specialized experts are not available. Deploying such a system would also alleviate the shortage of healthcare services in resource-poor regions, i.e., Africa, which is home to only 3% of the world's healthcare labor force and bears 24% of the global disease burden [3]. Medical VQA can contribute to sustainable development goals (SDGs) by reducing the cost of healthcare in resource-poor countries and promoting healthy living and well-being.

Most of the current medical VQA methods adopt a joint embedding framework [4] that relies on pre-trained convolutional neural networks (CNNs) as backbones, such as the VGGNet [5], to capture visual structures. These black-box models tend to exploit the dataset bias by capturing the superficial correlations among visual appearances, questions, and answers [6, 7]. In fact, some state-of-the-art medical VQA algorithms do not even utilize the question feature and generate the answer using only the image feature [2]. The disadvantage of over-reliance on training data only is particularly obvious in the medical domain because of the limited and diverse training data. The authors in [8] proposed using Multiple Meta-model Quantifying (MMQ) processes to leverage meta-learning for enhancing performance on small-sized datasets. In VQAMix [11], a method was suggested that involves a linear combination of question-answer pairs to produce additional labeled training samples. Nonetheless, these advancements tend to have less impact on larger datasets.

More critically, current medical VQA datasets have several limitations: 1) They mostly focus on very simple questions [9] such as "What is the abnormality in this image?" or "Is there something wrong in the image?" (c). 2) They cover a wide range of modalities (MRI, CT, and X-ray) and various body sites (neuroimaging, chest X-rays, and abdominal CT/MRI scans). As the pathology of diseases in different body parts is very complicated and heterogeneous, medical images along with questions differ markedly across modalities, specialties, and diseases. Therefore, a universal medical VQA model is not a panacea and cannot be generalized to different modalities and body locations.

In the progression of a disease, multiple diseases may be interconnected. Cardiomegaly (enlargement of the heart) can increase pressure on the lungs, leading to initial signs of pulmonary edema (fluid in the lungs). This fluid can then accumulate in the pleural spaces, causing pleural effusion (fluid in the pleural spaces). Therefore, during the diagnostic process, doctors typically follow a "coarse-to-fine" routine. They first locate the relevant anatomical structure (such as the heart), then determine local abnormality (such as cardiomegaly), find relationships with other abnormalities (such as pleural effusion and edema), and finally make a diagnosis summary.

To address the current medical VQA limitations, we previously proposed using a rule-based method [11] to extract important clinical information and constructed a large-scale Difference VQA dataset aimed at addressing clinically important questions such as disease locations, severities, and progressions after treatment. However, the rule-based approach led to an error rate of about 85% in constructed KeyInfo datasets when given a large keyword coverage, due to the limitations of the conventional rule-based approach in processing uncertainty, negation, and other semantic information.

To further improve the quality of the medical VQA dataset, we propose training an LLM model to extract clinical questions and answers focusing on abnormalities, body location, disease level, abnormality type, and clinical evidence to mimic the process of practical diagnosis. We further evaluated our proposed LLM model on VQA dataset construction, achieving a 62% improvement compared to the rule-based method in the Intermediate KeyInfo dataset based on the same keyword coverage. This evaluation was performed by two clinical experts on a randomly sampled test dataset of 100 samples.

Methods

We constructed the Medical-CXR-VQA dataset from free-text reports in the source database MIMIC-CXR by following these steps: (1) LLM fine-tuning. (2) Building an intermediate KeyInfo dataset using the fine-tuned LLM. This dataset includes report findings and their corresponding attributes (e.g. locations, levels, types, etc.). (3) Extracting question-answer pairs from the KeyInfo dataset to construct the Medical-CXR-VQA dataset.

LLM Fine-tuning

There are two ways to instruct LLMs to perform a specific task. The first method is through prompts. However, prompts can be lengthy for complex tasks like ours. Due to the length limitations of LLMs, the generated output might not have sufficient space. Moreover, employing high-performance LLMs, such as GPT-4, is too expensive for large-scale applications. Thus, the second approach of LLM fine-tuning becomes necessary. By adapting the LLM to the specific task, significant performance enhancements can be achieved. This approach also optimizes output length and reduces processing time. However, LLM fine-tuning relies on high-quality data as the training dataset. Consequently, we decided to combine both methods in our usage.

Specifically, we first employed GPT-4 [13], which is recognized as the best-performing large language model during the development of this project, to generate structured key information from 100 examples, including abnormality, location, type, and level. To ensure GPT-4 generates the necessary key information in a JSON format, a detailed prompt has been designed. Next, we fine-tuned Llama 2 using the examples generated by GPT-4. During fine-tuning, we utilized a simplified prompt to reduce the input length. Both the detailed prompt and the simplified prompt can be found in [9].

Intermediate KeyInfo Dataset

With the fine-tuned Llama 2, we can subsequently apply it to the entire MIMIC-CXR dataset to extract structured key information from all the reports.

Moreover, to address the hallucination problem, we employ follow-up questions to enhance the original generation output of LLM. We utilize a rule-based key information dataset on MIMIC-CXR as a reference [10] and compare the generation output with this reference key information. If disparities are found, we prompt the LLM with an additional question such as "Is there [abnormality name] in this report? If so, please insert [abnormality name] into the 'entity' key array in the original JSON format. Otherwise, no action is required." This approach effectively reduces the occurrence of errors.

Next, we apply post-processing code to further standardize the format. This includes uniform duplication of names, splitting attributes with abnormality names, removing unwanted findings, and reassigning attributes.

To evaluate the usability and superiority of the LLM-extracted KeyInfo dataset, we conducted a comparison of the correctness rates between 100 random samples extracted from the LLM method and the rule-based method. The rule-based method utilized all keywords extracted by LLM and underwent a specific post-processing procedure. This procedure involved removing invalid abnormalities and duplicate abnormalities, filtering abnormalities with low frequency, standardizing names, and re-assigning attributes. The evaluation of the 100 LLM-extracted KeyInfo samples was performed by two professional radiologists. The results of this evaluation are presented below. The attribute-level error represents errors in location, type, and level. We can see that the LLM-based method outperformed the Rule-based method by 62% in the combined correct rate.

Comparison between LLM-based method and Rule-based method
Method	LLM-based	Rule-based
Disease-level error	14	70
Attribute-level error	14	54
Correct rate (disease-level)	86%	30%
Correct rate (attribute-level)	86%	46%
Correct rate (combined)	77%	15%

Extracting Question-Answer Pairs

After constructing the KeyInfo dataset, we were able to obtain all the information necessary to generate questions based on the clinicians' interests. Please note that not all images from MIMIC-CXR are included in the Medical-CXR-VQA dataset. We only consider antero-posterior and postero-anterior views. For studies that have only lateral view images or don't have clearly identified view names for the images, we filter those images out. We generated six types of questions including abnormality, location, type, level, view, and presence, as shown in the table below:

Full list of examples for each question type
Question type	Example
Abnormality	what abnormalities are seen in the image?
	what abnormalities are seen in the <location>?
	is there any evidence of any abnormalities?
	is this image normal?
Presence	any evidence of <abnormality>?
	is there <abnormality>?
	is there <abnormality> in the <location>?
View	which view is this image taken?
	is this PA view?
	is this AP view?
Location	where is the <abnormality> located?
	where is the <abnormality>?
	is the <abnormality> located on the left/right?
	is the <abnormality> in the <location>?
Level	what level is the <abnormality>?
Type	what type is the <abnormality>?

For each study, we executed the extraction process once for each question type, thereby generating a dataset that is theoretically six times as large as the MIMIC-CXR database.

To ensure the correctness of the final Medical-CXR-VQA dataset, we validated it with two human verifiers. The validation results are shown in the table below:

Validation results by human verifiers
Verifier	example #	correctness #	Acc
Verifier 1	250	241	96.4%
Verifier 2	250	233	93.2%
Total	500	474	94.8%

Data Description

The Medical-CXR-VQA dataset comprises 780,014 pairs of question-answer, categorized into six types: 190,525 for abnormality, 104,680 for location, 69,486 for type, 111,715 for level, 92,048 for view, and 211,560 for presence. These pairs were extracted from a total of 215,547 studies.

Libs

The "libs" directory contains six CSV files and four JSON files, which are used for post-processing and question-answer generation. These files are listed below:

disease_lib_llm.csv: Library of disease keywords. The columns include the following:
- id: Integer assigned in sequence.
- report_name: Possible disease names that appear in the reports. Variants with the same meanings are separated by a ";".
- official_name: The standardized names that were assigned to the variants.
- location: The anatomical structure where the disease could appear.
location_lib.csv: Library of location keywords.
postlocation_lib.csv: Library of location keywords that appear after the anchor word.
type_lib.csv: Library of type keywords.
level_lib.csv: Library of level Keywords.
position_change.csv: A table used to standardize keywords. It includes two columns: "from" and "to". The "from" column lists the original word, and the "to" column lists the standardized word.
entity_dict.json: Disease names with appearance frequencies in the KeyInfo set.
type_dict.json: Disease types with appearance frequencies in the KeyInfo set.
level_dict.json: Disease levels with appearance frequencies in the KeyInfo set.
location_dict.json: Disease locations with appearance frequencies in the KeyInfo set.

Data

We provide four data files: medical_cxr_vqa_questions.csv, mimic_all.csv, all_diseases.json, and all_diseases_gpt4_100.json.

medical_cxr_vqa_questions.csv

This is the final extracted question-answer pairs. The columns include the following:

study_id: study_id in MIMIC-CXR.
subject_id: subject id in MIMIC-CXR.
question_type: abnormality/location/level/view/type/presence.
question
answer: a list of answers. The answer candidates could be more than one.
split: The recommended train/val/test split.

mimic_all.csv

This file includes the metadata from MIMIC-CXR and labels from MIMIC-CXR-JPG [14]. The labels are for reference purposes only. The columns are:

subject_id: subject id in MIMIC-CXR
study_id: study_id in MIMIC-CXR
labels: This section contains the labels that were extracted from MIMIC-CXR-JPG, but it is only for reference purposes.
- Atelectasis
- Cardiomegaly
- Consolidation
- Edema
- Enlarged Cardiomediastinum
- Fracture
- Lung Lesion
- Lung Opacity
- Pleural Effusion
- Pneumonia
- Pneumothorax
- Pleural Other
- Support Devices
- No Finding
dicom_id: dicom_id in MIMIC-CXR
view: "PA" or "AP" view of the image
split: train/val/test split used in MIMIC-CXR-JPG. It is only for reference purposes.
study_date: StudyDate in MIMIC-CXR-JPG
study_order: an integer that indicates the order number of a patient's entire visit history.

all_diseases.json

The intermediate KeyInfo dataset. The columns include the following:

study_id: study id in MIMIC-CXR
subject_id: subject id in MIMIC-CXR
entity: The identified disease name.
- Disease name: a string represents the disease name
  - entity_name: same as the disease name
    - Location: The word indicating the location of the disease.
    - Type: The word describing the type or category of the disease.
    - Level: The word indicating the level or severity of the disease.
    - Location2: Same as "Location". Provided as a backup in case there are multiple locations.
    - Type2: Same as "Type". Provided as a backup in case there are multiple types or categories.
    - Level2: Same as "Level". Provided as a backup in case there are multiple levels or severities.
no_entity: a list containing the names of all diseases that are identified as not existing.

all_diseases_gpt4_100.json

This is the key information of 100 examples generated by GPT-4, used for fine-tuning LLaMA 2. This file has the same format as all_diseases.json.

Supplementary

Radiologist_evaluations.pdf: This document contains the raw evaluations by radiologists for 100 examples, detailing the data that was reviewed, including study ID, report, LLM-extracted key information, and radiologists' comments.

Folder structure

In this section, we present the structure of all the files that we are uploading.

.
├── libs/
│   ├── disease_lib_llm.csv
│   ├── level_lib.csv
│   ├── location_lib.csv
│   ├── ltype_lib.csv
│   ├── postlocation_lib.csv
│   ├── position_change.csv
│   ├── entity_dict.json
│   ├── type_dict.json
│   ├── level_dict.json
│   └── location_dict.csv
├── supplementary/
│   └── Radiologist_evaluations.pdf
├── medical_cxr_vqa_questions.csv
├── mimic_all.csv
├── all_diseases.json
└── all_diseases_gpt4_100.json

Usage Notes

Please refer to [9] for the code used to generate the Medical-CXR-VQA dataset. We have provided a step-by-step guide for generating the Medical-Diff-VQA dataset. Please be aware that the generated dataset may not be exactly the same as the one we provided due to randomness.

Our dataset does not include images. For images, please refer to the MIMIC-CXR-JPG [14] or MIMIC-CXR database [12].

We believe our dataset holds significant potential for advancing healthcare and medical research, such as assisting in diagnosing medical images, serving as educational tools for training healthcare professionals, and enhancing patient understanding by providing accessible answers to their imaging-related questions.

However, please be aware that despite our efforts to ensure the accuracy of the dataset, there may still be a small number of errors caused by hallucinations, such as extra attribute words preceding the abnormality keywords or misassignment of type, level, or location words.

Ethics

The dataset is derived from the MIMIC-CXR database, which is a de-identified dataset that we have been granted access to via the PhysioNet Credentialed Health Data Use Agreement (v1.5.0). The data was used both locally and on Azure OpenAI service, opting out of human review and agreeing with MIMIC's Data usage agreement.

Acknowledgements

This work was supported by JST Moonshot R&D Grant Num- ber JPMJMS2011 Japan. This research was funded in part by the Intramural Research Programs of the National Institutes of Health Clinical Center. Author RMS receives royalties for patents or software licenses from iCAD, Philips, ScanMed, PingAn, MGB, and Translation Holdings. His lab received research support through a CRADA with PingAn.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, Janda M, Lallas A, Longo C, Malvehy J, Paoli J. Human–computer collaboration for skin cancer recognition. Nature medicine. 2020 Aug;26(8):1229-34.
Lin Z, Zhang D, Tao Q, Shi D, Haffari G, Wu Q, He M, Ge Z. Medical visual question answering: A survey. Artificial Intelligence in Medicine. 2023 Sep 1;143:102611.
World Health Organization. The state of the health workforce in the WHO African Region, 2021.
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision 2015 (pp. 2425-2433).
Simonyan K. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014.
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 6904-6913).
Cao Q, Wan W, Wang K, Liang X, Lin L. Linguistically routing capsule network for out-of-distribution visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021 (pp. 1614-1623).
Do T, Nguyen BX, Tjiputra E, Tran M, Tran QD, Nguyen A. Multiple meta-model quantifying for medical visual question answering. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24 2021 (pp. 64-74). Springer International Publishing.
https://github.com/Holipori/Medical-CXR-VQA
Hu X, Gu L, An Q, Zhang M, Liu L, Kobayashi K, Harada T, Summers RM, Zhu Y. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2023 Aug 6 (pp. 4156-4165).
Gong H, Chen G, Mao M, Li Z, Li G. Vqamix: Conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging. 2022 Jun 21;41(11):3332-43.
Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data. 2019 Dec 12;6(1):317.
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.
Johnson AE, Pollard TJ, Greenbaum NR, Lungren MP, Deng CY, Peng Y, Lu Z, Mark RG, Berkowitz SJ, Horng S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019 Jan 21.