Name: Visual Question Answering evaluation dataset for MIMIC CXR
Published: Jan. 28, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Restricted Access

Timo Kohlberger , Charles Lau , Tom Pollard , Andrew Sellergren , Atilla Kiraly , Fayaz Jamil

Published: Jan. 28, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Kohlberger, T., Lau, C., Pollard, T., Sellergren, A., Kiraly, A., & Jamil, F. (2025). Visual Question Answering evaluation dataset for MIMIC CXR (version 1.0.0). PhysioNet. https://doi.org/10.13026/cvsk-ny21.

MLA	Kohlberger, Timo, et al. "Visual Question Answering evaluation dataset for MIMIC CXR" (version 1.0.0). PhysioNet (2025), https://doi.org/10.13026/cvsk-ny21.
APA	Kohlberger, T., Lau, C., Pollard, T., Sellergren, A., Kiraly, A., & Jamil, F. (2025). Visual Question Answering evaluation dataset for MIMIC CXR (version 1.0.0). PhysioNet. https://doi.org/10.13026/cvsk-ny21.
Chicago	Kohlberger, Timo, Lau, Charles, Pollard, Tom, Sellergren, Andrew, Kiraly, Atilla, and Fayaz Jamil. "Visual Question Answering evaluation dataset for MIMIC CXR" (version 1.0.0). PhysioNet (2025). https://doi.org/10.13026/cvsk-ny21.
Harvard	Kohlberger, T., Lau, C., Pollard, T., Sellergren, A., Kiraly, A., and Jamil, F. (2025) 'Visual Question Answering evaluation dataset for MIMIC CXR' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/cvsk-ny21.
Vancouver	Kohlberger T, Lau C, Pollard T, Sellergren A, Kiraly A, Jamil F. Visual Question Answering evaluation dataset for MIMIC CXR (version 1.0.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/cvsk-ny21.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

MIMIC CXR [1] is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. In addition, labels for the presence of 12 different chest-related pathologies, as well as of any support devices, and overall normal/abnormal status were made available via the MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) [2] labels, which were generated using the CheXpert and NegBio algorithms.

Based on these labels, we created a visual question answering dataset comprising 224 questions for 48 cases from the official test set, and 111 questions for 23 validation cases. A majority (68%) of the questions are close-ended (answerable with yes or no), and focus on the presence of one out of 15 chest pathologies, or any support device, or generically on any abnormality, whereas the remaining open-ended questions inquire about the location, size, severity or type of a pathology/device, if present in the specific case, indicated by the MIMIC-CXR-JPG labels.

For each question and case we also provide a reference answer, which was authored by a board-certified radiologist (with 17 years of post-residency experience) based on the chest X-ray and original radiology report

Background

Generating correct answers to visual questions (VQA) given a medical image, such as a chest radiograph (chest X-ray), is an important capability recent visual language models, which is likely to enable many AI-supported applications in the medical field.

To that end, it is important to evaluate the accuracy of a multi-modal generative AI models on this kind of task, which was done for models we had published in pre-preprints [4] and [5], for example.

Methods

We will first describe the selection of validation and test set cases, followed by a heuristic for assigning pre-formulated questions from Table 1 to each case, and finally the curation of ground truth answers by a board-certified radiologist.

For the test set, we randomly selected 8 cases from respective subsets of the official test set for which each of the following conditions were present (as indicated by the corresponding MIMIC-CXR-JPG label being equal to 1.0):

Pneumothorax
Pleural Effusion
Edema
Consolidation OR Pneumonia
Lung Lesion

For some of these cases more than the filtered-for condition happened to present, e.g. also edema while having filtered cases for with the pneumothorax label being 1.0.

In addition, we sampled 8 cases where the No Finding label was 1.0, yielding a set of 48 test cases in total.

We used the same case sampling method for the VQA validation set, except for picking only 7 pleural effusion cases, and only 4 cases with a lung lesion, consolidation or pneumonia, or edema being present, respectively.

Next, for each of these sampled cases, we picked questions by employing the following heuristic:

Depending on the condition the case was selected for, for each of the question types listed in Table 1, e.g. "presence" or "location", one of the questions in the second column was assigned.
If the case has other conditions present (i.e. labeled with 1.0), randomly pick one among those, and also assign the corresponding questions from Table 1 as described in step 1.
If no other conditions are present, randomly pick a non-positively labeled disease and assign only the corresponding presence question from Table 1.
Assign a randomly chosen generic, "condition-independent" question from Table 1.

Using these heuristics, there are at least two condition-dependent questions and one generic question per selected case, with additional questions for positive conditions inquiring about the location, type, or severity, in addition to the presence alone.

In total there are 226 questions (154 of which are close-ended) and reference answers for 48 test cases (and the same number of unique study IDs and DICOM IDs), and 111 question and answers (73 close-ended) for 23 validation cases (and unique study IDs and DICOM IDs).

After having selected cases and generated relevant visual questions, a board-certified radiologist determined ground truth answers based on the CXR image and original radiology report. They based their determinations predominantly on the CXR image and their own professional judgment. For some cases and questions the information to generate an answer was not contained in the report.

Data Description

The file mimic_cxr_vqa.tsv contains the following columns:

Column in TSV file	Description
study_id	Study ID of case
dicom_id	DICOM ID of the specific CXR image
split_vqa	Indicates which VQA split the question/image/answer belongs to (can be either "test" or "validate").
question_id	Unique of the question/image/answer triplet.
question	Question
expected_answer	Expert-curated reference answer

Usage Notes

Between the completion of this VQA dataset and our submission to PhysioNet, another VQA dataset for MIMIC CXR was published, Medical-Diff-VQA [3], which also incorporates consecutive exams (for the same patient).

This VQA dataset has been used in the evaluation of the ELIXR model in [4] and the Med-Gemini 2D model [5].

Ethics

Data used consisted of de-identified radiology report text and labels from the existing MIMIC-CXR and MIMIC-CXR-JPG databases, respectively.

Acknowledgements

We would like to thank Jonathan Krause, Yun Liu and Dale Webster for their feedback in reviewing the manuscript. We would also like to thank the radiologist, Chuck Lau, for curating the reference answers. All work was funded by Google LLC.

Conflicts of Interest

The authors of this report are employees of Google LLC and own Alphabet stock.

References

Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data. 2019 Dec 12;6(1):1–8.
Johnson A, al. E. MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0) [Internet]. Available from: http://dx.doi.org/10.13026/8360-t248
Hu, X., Gu, L., An, Q., Zhang, M., liu, l., Kobayashi, K., Harada, T., Summers, R., & Zhu, Y. (2023). Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/5jes-bx23
Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Attila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, et al. ELIXR: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
Yang L, Xu S, Sellergren A, Kohlberger T, Zhou Y, Ktena I, et al. Advancing Multimodal Medical Capabilities of Gemini [Internet]. 2024 [cited 2024 May 7]. Available from: http://arxiv.org/abs/2405.03162