Database Restricted Access
Visual Question Answering evaluation dataset for MIMIC CXR
Timo Kohlberger , Charles Lau , Tom Pollard , Andrew Sellergren , Atilla Kiraly , Fayaz Jamil
Published: Jan. 28, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Kohlberger, T., Lau, C., Pollard, T., Sellergren, A., Kiraly, A., & Jamil, F. (2025). Visual Question Answering evaluation dataset for MIMIC CXR (version 1.0.0). PhysioNet. https://doi.org/10.13026/cvsk-ny21.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
MIMIC CXR [1] is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. In addition, labels for the presence of 12 different chest-related pathologies, as well as of any support devices, and overall normal/abnormal status were made available via the MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) [2] labels, which were generated using the CheXpert and NegBio algorithms.
Based on these labels, we created a visual question answering dataset comprising 224 questions for 48 cases from the official test set, and 111 questions for 23 validation cases. A majority (68%) of the questions are close-ended (answerable with yes or no), and focus on the presence of one out of 15 chest pathologies, or any support device, or generically on any abnormality, whereas the remaining open-ended questions inquire about the location, size, severity or type of a pathology/device, if present in the specific case, indicated by the MIMIC-CXR-JPG labels.
For each question and case we also provide a reference answer, which was authored by a board-certified radiologist (with 17 years of post-residency experience) based on the chest X-ray and original radiology report
Background
Generating correct answers to visual questions (VQA) given a medical image, such as a chest radiograph (chest X-ray), is an important capability recent visual language models, which is likely to enable many AI-supported applications in the medical field.
To that end, it is important to evaluate the accuracy of a multi-modal generative AI models on this kind of task, which was done for models we had published in pre-preprints [4] and [5], for example.
Methods
We will first describe the selection of validation and test set cases, followed by a heuristic for assigning pre-formulated questions from Table 1 to each case, and finally the curation of ground truth answers by a board-certified radiologist.
For the test set, we randomly selected 8 cases from respective subsets of the official test set for which each of the following conditions were present (as indicated by the corresponding MIMIC-CXR-JPG label being equal to 1.0):
- Pneumothorax
- Pleural Effusion
- Edema
- Consolidation OR Pneumonia
- Lung Lesion
For some of these cases more than the filtered-for condition happened to present, e.g. also edema while having filtered cases for with the pneumothorax label being 1.0.
In addition, we sampled 8 cases where the No Finding label was 1.0, yielding a set of 48 test cases in total.
We used the same case sampling method for the VQA validation set, except for picking only 7 pleural effusion cases, and only 4 cases with a lung lesion, consolidation or pneumonia, or edema being present, respectively.
Next, for each of these sampled cases, we picked questions by employing the following heuristic:
- Depending on the condition the case was selected for, for each of the question types listed in Table 1, e.g. "presence" or "location", one of the questions in the second column was assigned.
- If the case has other conditions present (i.e. labeled with 1.0), randomly pick one among those, and also assign the corresponding questions from Table 1 as described in step 1.
If no other conditions are present, randomly pick a non-positively labeled disease and assign only the corresponding presence question from Table 1. - Assign a randomly chosen generic, "condition-independent" question from Table 1.
Using these heuristics, there are at least two condition-dependent questions and one generic question per selected case, with additional questions for positive conditions inquiring about the location, type, or severity, in addition to the presence alone.
In total there are 226 questions (154 of which are close-ended) and reference answers for 48 test cases (and the same number of unique study IDs and DICOM IDs), and 111 question and answers (73 close-ended) for 23 validation cases (and unique study IDs and DICOM IDs).
After having selected cases and generated relevant visual questions, a board-certified radiologist determined ground truth answers based on the CXR image and original radiology report. They based their determinations predominantly on the CXR image and their own professional judgment. For some cases and questions the information to generate an answer was not contained in the report.
Data Description
The file mimic_cxr_vqa.tsv contains the following columns:
Column in TSV file |
Description |
study_id |
Study ID of case |
dicom_id |
DICOM ID of the specific CXR image |
split_vqa |
Indicates which VQA split the question/image/answer belongs to (can be either "test" or "validate"). |
question_id |
Unique of the question/image/answer triplet. |
question |
Question |
expected_answer |
Expert-curated reference answer |
Usage Notes
Between the completion of this VQA dataset and our submission to PhysioNet, another VQA dataset for MIMIC CXR was published, Medical-Diff-VQA [3], which also incorporates consecutive exams (for the same patient).
This VQA dataset has been used in the evaluation of the ELIXR model in [4] and the Med-Gemini 2D model [5].
Ethics
Data used consisted of de-identified radiology report text and labels from the existing MIMIC-CXR and MIMIC-CXR-JPG databases, respectively.
Acknowledgements
We would like to thank Jonathan Krause, Yun Liu and Dale Webster for their feedback in reviewing the manuscript. We would also like to thank the radiologist, Chuck Lau, for curating the reference answers. All work was funded by Google LLC.
Conflicts of Interest
The authors of this report are employees of Google LLC and own Alphabet stock.
References
- Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data. 2019 Dec 12;6(1):1–8.
- Johnson A, al. E. MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0) [Internet]. Available from: http://dx.doi.org/10.13026/8360-t248
- Hu, X., Gu, L., An, Q., Zhang, M., liu, l., Kobayashi, K., Harada, T., Summers, R., & Zhu, Y. (2023). Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet. https://doi.org/10.13026/5jes-bx23
- Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Attila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, et al. ELIXR: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
- Yang L, Xu S, Sellergren A, Kohlberger T, Zhou Y, Ktena I, et al. Advancing Multimodal Medical Capabilities of Gemini [Internet]. 2024 [cited 2024 May 7]. Available from: http://arxiv.org/abs/2405.03162
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/cvsk-ny21
DOI (latest version):
https://doi.org/10.13026/tz5h-1w39
Corresponding Author
Files
- sign the data use agreement for the project