Database Credentialed Access
Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images
Xinyue Hu , Lin Gu , Kazuma Kobayashi , liangchen liu , Mengliang Zhang , Tatsuya Harada , Ronald Summers , Yingying Zhu
Published: Jan. 21, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Hu, X., Gu, L., Kobayashi, K., liu, l., Zhang, M., Harada, T., Summers, R., & Zhu, Y. (2025). Medical-CXR-VQA dataset: A Large-Scale LLM-Enhanced Medical Dataset for Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/1pm5-hy02.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Medical Visual Question Answering (VQA) is an important task in medical multi-modal Large Language Models (LLMs), aiming to answer clinically relevant questions regarding input medical images. This technique has the potential to improve the efficiency of medical professionals while relieving the burden on the public health system, particularly in resource-poor countries. However, existing medical VQA datasets are small and only contain simple questions (equivalent to classification tasks), which lack semantic reasoning and clinical knowledge. Our previous work proposed a clinical knowledge-driven image difference VQA benchmark using a rule-based approach. However, given the same large-scale breadth of information coverage, the rule-based approach shows an 85% error rate on extracted labels. We trained an LLM method to extract labels with 62% increased accuracy. We also comprehensively evaluated our labels with 2 clinical experts on 100 samples to help us fine-tune the LLM. Based on the trained LLM model, we proposed a large-scale medical VQA dataset, Medical-CXR-VQA, derived from the MIMIC-CXR dataset and comprises 780,014 question-answer pairs, categorized into six types: abnormality (190,525 pairs), location (104,680 pairs), type (69,486 pairs), level (111,715 pairs), view (92,048 pairs), and presence (211,560 pairs).
Background
Medical Visual Question Answering (VQA) is an important task in medical multi-modal Large Language Models (LLMs), aiming to answer clinically relevant questions related to medical images. This is a challenging task that requires both medical image diagnosis and natural language understanding. Medical VQA can provide clinicians with a "second opinion" in interpreting medical images and decrease the risk of misdiagnosis [1]. It also has the potential to relieve the burden on radiologists by partially taking over their expert consultant role to answer questions from physicians and patients, preventing the disruption of their workflow and improving efficiency [2].
Multi-modal LLMs can be utilized to perform these tasks, which can assist in reducing global health inequalities in low- and middle-income countries. For example, when interpreting complex cases, the second opinion provided by the medical VQA system may significantly enhance the junior clinicians' confidence when specialized experts are not available. Deploying such a system would also alleviate the shortage of healthcare services in resource-poor regions, i.e., Africa, which is home to only 3% of the world's healthcare labor force and bears 24% of the global disease burden [3]. Medical VQA can contribute to sustainable development goals (SDGs) by reducing the cost of healthcare in resource-poor countries and promoting healthy living and well-being.
Most of the current medical VQA methods adopt a joint embedding framework [4] that relies on pre-trained convolutional neural networks (CNNs) as backbones, such as the VGGNet [5], to capture visual structures. These black-box models tend to exploit the dataset bias by capturing the superficial correlations among visual appearances, questions, and answers [6, 7]. In fact, some state-of-the-art medical VQA algorithms do not even utilize the question feature and generate the answer using only the image feature [2]. The disadvantage of over-reliance on training data only is particularly obvious in the medical domain because of the limited and diverse training data. The authors in [8] proposed using Multiple Meta-model Quantifying (MMQ) processes to leverage meta-learning for enhancing performance on small-sized datasets. In VQAMix [11], a method was suggested that involves a linear combination of question-answer pairs to produce additional labeled training samples. Nonetheless, these advancements tend to have less impact on larger datasets.
More critically, current medical VQA datasets have several limitations: 1) They mostly focus on very simple questions [9] such as "What is the abnormality in this image?" or "Is there something wrong in the image?" (c). 2) They cover a wide range of modalities (MRI, CT, and X-ray) and various body sites (neuroimaging, chest X-rays, and abdominal CT/MRI scans). As the pathology of diseases in different body parts is very complicated and heterogeneous, medical images along with questions differ markedly across modalities, specialties, and diseases. Therefore, a universal medical VQA model is not a panacea and cannot be generalized to different modalities and body locations.
In the progression of a disease, multiple diseases may be interconnected. Cardiomegaly (enlargement of the heart) can increase pressure on the lungs, leading to initial signs of pulmonary edema (fluid in the lungs). This fluid can then accumulate in the pleural spaces, causing pleural effusion (fluid in the pleural spaces). Therefore, during the diagnostic process, doctors typically follow a "coarse-to-fine" routine. They first locate the relevant anatomical structure (such as the heart), then determine local abnormality (such as cardiomegaly), find relationships with other abnormalities (such as pleural effusion and edema), and finally make a diagnosis summary.
To address the current medical VQA limitations, we previously proposed using a rule-based method [11] to extract important clinical information and constructed a large-scale Difference VQA dataset aimed at addressing clinically important questions such as disease locations, severities, and progressions after treatment. However, the rule-based approach led to an error rate of about 85% in constructed KeyInfo datasets when given a large keyword coverage, due to the limitations of the conventional rule-based approach in processing uncertainty, negation, and other semantic information.
To further improve the quality of the medical VQA dataset, we propose training an LLM model to extract clinical questions and answers focusing on abnormalities, body location, disease level, abnormality type, and clinical evidence to mimic the process of practical diagnosis. We further evaluated our proposed LLM model on VQA dataset construction, achieving a 62% improvement compared to the rule-based method in the Intermediate KeyInfo dataset based on the same keyword coverage. This evaluation was performed by two clinical experts on a randomly sampled test dataset of 100 samples.
Methods
We constructed the Medical-CXR-VQA dataset from free-text reports in the source database MIMIC-CXR by following these steps: (1) LLM fine-tuning. (2) Building an intermediate KeyInfo dataset using the fine-tuned LLM. This dataset includes report findings and their corresponding attributes (e.g. locations, levels, types, etc.). (3) Extracting question-answer pairs from the KeyInfo dataset to construct the Medical-CXR-VQA dataset.
LLM Fine-tuning
There are two ways to instruct LLMs to perform a specific task. The first method is through prompts. However, prompts can be lengthy for complex tasks like ours. Due to the length limitations of LLMs, the generated output might not have sufficient space. Moreover, employing high-performance LLMs, such as GPT-4, is too expensive for large-scale applications. Thus, the second approach of LLM fine-tuning becomes necessary. By adapting the LLM to the specific task, significant performance enhancements can be achieved. This approach also optimizes output length and reduces processing time. However, LLM fine-tuning relies on high-quality data as the training dataset. Consequently, we decided to combine both methods in our usage.
Specifically, we first employed GPT-4 [13], which is recognized as the best-performing large language model during the development of this project, to generate structured key information from 100 examples, including abnormality, location, type, and level. To ensure GPT-4 generates the necessary key information in a JSON format, a detailed prompt has been designed. Next, we fine-tuned Llama 2 using the examples generated by GPT-4. During fine-tuning, we utilized a simplified prompt to reduce the input length. Both the detailed prompt and the simplified prompt can be found in [9].
Intermediate KeyInfo Dataset
With the fine-tuned Llama 2, we can subsequently apply it to the entire MIMIC-CXR dataset to extract structured key information from all the reports.
Moreover, to address the hallucination problem, we employ follow-up questions to enhance the original generation output of LLM. We utilize a rule-based key information dataset on MIMIC-CXR as a reference [10] and compare the generation output with this reference key information. If disparities are found, we prompt the LLM with an additional question such as "Is there [abnormality name] in this report? If so, please insert [abnormality name] into the 'entity' key array in the original JSON format. Otherwise, no action is required." This approach effectively reduces the occurrence of errors.
Next, we apply post-processing code to further standardize the format. This includes uniform duplication of names, splitting attributes with abnormality names, removing unwanted findings, and reassigning attributes.
To evaluate the usability and superiority of the LLM-extracted KeyInfo dataset, we conducted a comparison of the correctness rates between 100 random samples extracted from the LLM method and the rule-based method. The rule-based method utilized all keywords extracted by LLM and underwent a specific post-processing procedure. This procedure involved removing invalid abnormalities and duplicate abnormalities, filtering abnormalities with low frequency, standardizing names, and re-assigning attributes. The evaluation of the 100 LLM-extracted KeyInfo samples was performed by two professional radiologists. The results of this evaluation are presented below. The attribute-level error represents errors in location, type, and level. We can see that the LLM-based method outperformed the Rule-based method by 62% in the combined correct rate.
Method | LLM-based | Rule-based |
---|---|---|
Disease-level error | 14 | 70 |
Attribute-level error | 14 | 54 |
Correct rate (disease-level) | 86% | 30% |
Correct rate (attribute-level) | 86% | 46% |
Correct rate (combined) | 77% | 15% |
Extracting Question-Answer Pairs
After constructing the KeyInfo dataset, we were able to obtain all the information necessary to generate questions based on the clinicians' interests. Please note that not all images from MIMIC-CXR are included in the Medical-CXR-VQA dataset. We only consider antero-posterior and postero-anterior views. For studies that have only lateral view images or don't have clearly identified view names for the images, we filter those images out. We generated six types of questions including abnormality, location, type, level, view, and presence, as shown in the table below:
Question type | Example |
Abnormality | what abnormalities are seen in the image? |
what abnormalities are seen in the <location>? | |
is there any evidence of any abnormalities? | |
is this image normal? | |
Presence | any evidence of <abnormality>? |
is there <abnormality>? | |
is there <abnormality> in the <location>? | |
View | which view is this image taken? |
is this PA view? | |
is this AP view? | |
Location | where is the <abnormality> located? |
where is the <abnormality>? | |
is the <abnormality> located on the left/right? | |
is the <abnormality> in the <location>? | |
Level | what level is the <abnormality>? |
Type | what type is the <abnormality>? |
For each study, we executed the extraction process once for each question type, thereby generating a dataset that is theoretically six times as large as the MIMIC-CXR database.
To ensure the correctness of the final Medical-CXR-VQA dataset, we validated it with two human verifiers. The validation results are shown in the table below:
Verifier | example # | correctness # | Acc |
Verifier 1 | 250 | 241 | 96.4% |
Verifier 2 | 250 | 233 | 93.2% |
Total | 500 | 474 | 94.8% |
Data Description
The Medical-CXR-VQA dataset comprises 780,014 pairs of question-answer, categorized into six types: 190,525 for abnormality, 104,680 for location, 69,486 for type, 111,715 for level, 92,048 for view, and 211,560 for presence. These pairs were extracted from a total of 215,547 studies.
Libs
The "libs" directory contains six CSV files and four JSON files, which are used for post-processing and question-answer generation. These files are listed below:
- disease_lib_llm.csv: Library of disease keywords. The columns include the following:
- id: Integer assigned in sequence.
- report_name: Possible disease names that appear in the reports. Variants with the same meanings are separated by a ";".
- official_name: The standardized names that were assigned to the variants.
- location: The anatomical structure where the disease could appear.
- location_lib.csv: Library of location keywords.
- postlocation_lib.csv: Library of location keywords that appear after the anchor word.
- type_lib.csv: Library of type keywords.
- level_lib.csv: Library of level Keywords.
- position_change.csv: A table used to standardize keywords. It includes two columns: "from" and "to". The "from" column lists the original word, and the "to" column lists the standardized word.
- entity_dict.json: Disease names with appearance frequencies in the KeyInfo set.
- type_dict.json: Disease types with appearance frequencies in the KeyInfo set.
- level_dict.json: Disease levels with appearance frequencies in the KeyInfo set.
- location_dict.json: Disease locations with appearance frequencies in the KeyInfo set.
Data
We provide four data files: medical_cxr_vqa_questions.csv, mimic_all.csv, all_diseases.json, and all_diseases_gpt4_100.json.
medical_cxr_vqa_questions.csv
This is the final extracted question-answer pairs. The columns include the following:
- study_id: study_id in MIMIC-CXR.
- subject_id: subject id in MIMIC-CXR.
- question_type: abnormality/location/level/view/type/presence.
- question
- answer: a list of answers. The answer candidates could be more than one.
- split: The recommended train/val/test split.
mimic_all.csv
This file includes the metadata from MIMIC-CXR and labels from MIMIC-CXR-JPG [14]. The labels are for reference purposes only. The columns are:
- subject_id: subject id in MIMIC-CXR
- study_id: study_id in MIMIC-CXR
- labels: This section contains the labels that were extracted from MIMIC-CXR-JPG, but it is only for reference purposes.
- Atelectasis
- Cardiomegaly
- Consolidation
- Edema
- Enlarged Cardiomediastinum
- Fracture
- Lung Lesion
- Lung Opacity
- Pleural Effusion
- Pneumonia
- Pneumothorax
- Pleural Other
- Support Devices
- No Finding
- dicom_id: dicom_id in MIMIC-CXR
- view: "PA" or "AP" view of the image
- split: train/val/test split used in MIMIC-CXR-JPG. It is only for reference purposes.
- study_date: StudyDate in MIMIC-CXR-JPG
- study_order: an integer that indicates the order number of a patient's entire visit history.
all_diseases.json
The intermediate KeyInfo dataset. The columns include the following:
- study_id: study id in MIMIC-CXR
- subject_id: subject id in MIMIC-CXR
- entity: The identified disease name.
- Disease name: a string represents the disease name
- entity_name: same as the disease name
- Location: The word indicating the location of the disease.
- Type: The word describing the type or category of the disease.
- Level: The word indicating the level or severity of the disease.
- Location2: Same as "Location". Provided as a backup in case there are multiple locations.
- Type2: Same as "Type". Provided as a backup in case there are multiple types or categories.
- Level2: Same as "Level". Provided as a backup in case there are multiple levels or severities.
- entity_name: same as the disease name
- Disease name: a string represents the disease name
- no_entity: a list containing the names of all diseases that are identified as not existing.
all_diseases_gpt4_100.json
This is the key information of 100 examples generated by GPT-4, used for fine-tuning LLaMA 2. This file has the same format as all_diseases.json.
Supplementary
Radiologist_evaluations.pdf: This document contains the raw evaluations by radiologists for 100 examples, detailing the data that was reviewed, including study ID, report, LLM-extracted key information, and radiologists' comments.
Folder structure
In this section, we present the structure of all the files that we are uploading.
.
├── libs/
│ ├── disease_lib_llm.csv
│ ├── level_lib.csv
│ ├── location_lib.csv
│ ├── ltype_lib.csv
│ ├── postlocation_lib.csv
│ ├── position_change.csv
│ ├── entity_dict.json
│ ├── type_dict.json
│ ├── level_dict.json
│ └── location_dict.csv
├── supplementary/
│ └── Radiologist_evaluations.pdf
├── medical_cxr_vqa_questions.csv
├── mimic_all.csv
├── all_diseases.json
└── all_diseases_gpt4_100.json
Usage Notes
Please refer to [9] for the code used to generate the Medical-CXR-VQA dataset. We have provided a step-by-step guide for generating the Medical-Diff-VQA dataset. Please be aware that the generated dataset may not be exactly the same as the one we provided due to randomness.
Our dataset does not include images. For images, please refer to the MIMIC-CXR-JPG [14] or MIMIC-CXR database [12].
We believe our dataset holds significant potential for advancing healthcare and medical research, such as assisting in diagnosing medical images, serving as educational tools for training healthcare professionals, and enhancing patient understanding by providing accessible answers to their imaging-related questions.
However, please be aware that despite our efforts to ensure the accuracy of the dataset, there may still be a small number of errors caused by hallucinations, such as extra attribute words preceding the abnormality keywords or misassignment of type, level, or location words.
Ethics
The dataset is derived from the MIMIC-CXR database, which is a de-identified dataset that we have been granted access to via the PhysioNet Credentialed Health Data Use Agreement (v1.5.0). The data was used both locally and on Azure OpenAI service, opting out of human review and agreeing with MIMIC's Data usage agreement.
Acknowledgements
This work was supported by JST Moonshot R&D Grant Num- ber JPMJMS2011 Japan. This research was funded in part by the Intramural Research Programs of the National Institutes of Health Clinical Center. Author RMS receives royalties for patents or software licenses from iCAD, Philips, ScanMed, PingAn, MGB, and Translation Holdings. His lab received research support through a CRADA with PingAn.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, Janda M, Lallas A, Longo C, Malvehy J, Paoli J. Human–computer collaboration for skin cancer recognition. Nature medicine. 2020 Aug;26(8):1229-34.
- Lin Z, Zhang D, Tao Q, Shi D, Haffari G, Wu Q, He M, Ge Z. Medical visual question answering: A survey. Artificial Intelligence in Medicine. 2023 Sep 1;143:102611.
- World Health Organization. The state of the health workforce in the WHO African Region, 2021.
- Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision 2015 (pp. 2425-2433).
- Simonyan K. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014.
- Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 6904-6913).
- Cao Q, Wan W, Wang K, Liang X, Lin L. Linguistically routing capsule network for out-of-distribution visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021 (pp. 1614-1623).
- Do T, Nguyen BX, Tjiputra E, Tran M, Tran QD, Nguyen A. Multiple meta-model quantifying for medical visual question answering. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24 2021 (pp. 64-74). Springer International Publishing.
- https://github.com/Holipori/Medical-CXR-VQA
- Hu X, Gu L, An Q, Zhang M, Liu L, Kobayashi K, Harada T, Summers RM, Zhu Y. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2023 Aug 6 (pp. 4156-4165).
- Gong H, Chen G, Mao M, Li Z, Li G. Vqamix: Conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging. 2022 Jun 21;41(11):3332-43.
- Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data. 2019 Dec 12;6(1):317.
- Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.
- Johnson AE, Pollard TJ, Greenbaum NR, Lungren MP, Deng CY, Peng Y, Lu Z, Mark RG, Berkowitz SJ, Horng S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019 Jan 21.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/1pm5-hy02
DOI (latest version):
https://doi.org/10.13026/y7r6-4j38
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project