Name: MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing
Published: Nov. 15, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Benedikt Boecking , Naoto Usuyama , Shruthi Bannur , Daniel Coelho de Castro , Anton Schwaighofer , Stephanie Hyland , Harshita Sharma , Maria Teodora Wetscherek , Tristan Naumann , Aditya Nori , Javier Alvarez Valle , Hoifung Poon , Ozan Oktay

Published: Nov. 15, 2024. Version: 1.1.0

When using this resource, please cite: (show more options)
Boecking, B., Usuyama, N., Bannur, S., Coelho de Castro, D., Schwaighofer, A., Hyland, S., Sharma, H., Wetscherek, M. T., Naumann, T., Nori, A., Alvarez Valle, J., Poon, H., & Oktay, O. (2024). MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 1.1.0). PhysioNet. https://doi.org/10.13026/9g2z-jg61.

MLA	Boecking, Benedikt, et al. "MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing" (version 1.1.0). PhysioNet (2024), https://doi.org/10.13026/9g2z-jg61.
APA	Boecking, B., Usuyama, N., Bannur, S., Coelho de Castro, D., Schwaighofer, A., Hyland, S., Sharma, H., Wetscherek, M. T., Naumann, T., Nori, A., Alvarez Valle, J., Poon, H., & Oktay, O. (2024). MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 1.1.0). PhysioNet. https://doi.org/10.13026/9g2z-jg61.
Chicago	Boecking, Benedikt, Usuyama, Naoto, Bannur, Shruthi, Coelho de Castro, Daniel, Schwaighofer, Anton, Hyland, Stephanie, Sharma, Harshita, Wetscherek, Maria Teodora, Naumann, Tristan, Nori, Aditya, Alvarez Valle, Javier, Poon, Hoifung, and Ozan Oktay. "MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing" (version 1.1.0). PhysioNet (2024). https://doi.org/10.13026/9g2z-jg61.
Harvard	Boecking, B., Usuyama, N., Bannur, S., Coelho de Castro, D., Schwaighofer, A., Hyland, S., Sharma, H., Wetscherek, M. T., Naumann, T., Nori, A., Alvarez Valle, J., Poon, H., and Oktay, O. (2024) 'MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing' (version 1.1.0), PhysioNet. Available at: https://doi.org/10.13026/9g2z-jg61.
Vancouver	Boecking B, Usuyama N, Bannur S, Coelho de Castro D, Schwaighofer A, Hyland S, Sharma H, Wetscherek M T, Naumann T, Nori A, Alvarez Valle J, Poon H, Oktay O. MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 1.1.0). PhysioNet. 2024. Available from: https://doi.org/10.13026/9g2z-jg61.

Additionally, please cite the original publication:

Boecking B, Usuyama N, Bannur S, Castro D.C., Schwaighofer A, Hyland S, Wetscherek M, Naumann T, Nori A, Alvarez-Valle J, Poon H, and Oktay O. 2022. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, Oct 23–27, 2022, Proceedings, Part XXXVI. Springer-Verlag 1–21. https://doi.org/10.1007/978-3-031-20059-5_1

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

We release a new dataset, MS-CXR, with locally-aligned phrase grounding annotations by board-certified radiologists to facilitate the study of complex semantic modelling in biomedical vision–language processing. The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).

This large, well-balanced phrase grounding benchmark dataset contains carefully curated image regions annotated with descriptions of eight radiology findings, as verified by radiologists. Unlike existing chest X-ray benchmarks, this challenging phrase grounding task evaluates joint, local image-text reasoning while requiring real-world language understanding, e.g. to parse domain-specific location references, complex negations, and bias in reporting style. This data accompany work showing that principled textual semantic modelling can improve contrastive learning in self-supervised vision–language processing.

Background

Presently, no datasets exist that allow for phrase grounding of radiology findings, but some enable different forms of local image evaluations. VinDr [1,2], RSNA Pneumonia [3], and the NIH Chest X-ray [4] datasets provide bounding-box annotations but lack free-text descriptions. REFLACX [5,6] provides gaze locations (ellipses) captured with an eye tracker, dictated reports, but no full phrase matches to image regions. Phrase annotations for MIMIC-CXR [7,8] data released in [9] are of small size (350 studies), only contain two abnormalities, for some samples have shortened phrases that were adapted to simplify the task. The ground-truth set of ImaGenome [10] only contains 500 studies, bounding-box regions annotate anatomical regions rather than radiological findings, and its sentence annotations are not curated for grounding evaluation.

Methods

We first parse original MIMIC-IV reports and REFLACX radiology transcripts by extracting sentences to form a large pool of text descriptions of pathologies. These candidates are later filtered by deploying the CheXbert text classifier, in order to only keep phrases associated with the target pathologies whilst ensuring the following two criteria: (I) For a given study, there is only one sentence describing the target pathology, and (II) the sentence does not mention more than one finding that are irrelevant to each other. After extracting the text descriptions, they are paired with image annotations on a study level. At the final stage, a review process is conducted with two board-certified radiologists mainly to verify the match between the text and bounding box candidates. Moreover, in this review process, we also assessed the suitability of the annotation pairs for the phrase grounding task whilst ensuring clinical accuracy.

In detail, the phrase-image samples are filtered out if at least one of the following conditions is met:

Text describing a finding is not present in the image
Phrase/sentence does not describe a clinical finding or describes multiple unrelated abnormalities that appear in different lung regions.
There is a mismatch between the bounding box and phrase, such as image annotations are placed incorrectly or do not capture the true extent of the abnormality.
High uncertainty is expressed regarding reported findings, e.g. “there is questionable right lower lobe opacity”.
Chest X-ray is not suitable for assessment of the finding or has poor image quality.
Text contains differential diagnosis or longitudinal information that prohibits correct grounding via the single paired image.
Long sentences (>30 tokens), which often contain patient meta-information that is not shared between the two modalities (e.g. de-identified tokens).

Note that we only filter out phrases containing multiple findings, not images with multiple findings. For instance, if an image contains both pneumonia and atelectasis, with separate descriptions for each in the report, then we create two instances of phrase-bounding box pairs.

To further increase the size of our dataset, and to balance samples across classes, additional CXR studies are sampled at random, conditioned on the underrepresented pathologies. The following procedure is applied to create the pairs of image and text annotations for these selected studies: Text descriptions are extracted using the same methodology outlined above, using MIMIC-CXR and ImaGenome datasets [7,8,10], where the latter provides sentence extracts from a subset of MIMIC-CXR dataset for clinical findings. However, differently from the initial step, the corresponding bounding box annotations (either one or more per sentence) are created from scratch by radiologists for the finding described in the text, and the same filtering as above is applied by the annotator to discard candidates if the image and/or sentence is found unsuitable for the grounding task.

Split

To facilitate comparison, we provide a recommended split. This 70:15:15 train-validation-test split is defined on the patient level and is stratified according to pathology (finding category) and gender. We used MIMIC-IV v1.0.0 to obtain demographic information. Since each patient can have multiple samples with different pathologies, we used the rarest represented pathology for each patient. That is, for the purpose of stratification, if a patient had findings for both cardiomegaly (common) and edema (rare), we labelled that patient with edema. Statistics on the resulting splits are provided below. We note that this is the split used by [11].

Data Description

We provide bounding box and sentence pair annotations describing clinical findings visible in a given chest X-ray image. Each sentence describes a single pathology present in the image, and there could be multiple manually annotated bounding boxes corresponding to the description of the single radiological finding. Additionally, an image may have more than one pathology present, and we provide separate sets of bounding boxes for each phrase describing a unique pathology associated with an image. The annotations were collected on a subset of MIMIC-CXR images, which additionally contains labels across eight different pathologies: atelectasis, cardiomegaly, consolidation, edema, lung opacity, pleural effusion, pneumonia, and pneumothorax. These pathologies were chosen based on the overlap between pathology classes present in the existing datasets and the CheXbert classifier [11].

Folder structure

This project contains 3 files:

MS_CXR_Local_Alignment_v1.1.0.json: Phrase grounding annotations in MS-COCO JSON format.
MS_CXR_Local_Alignment_v1.1.0.csv: The same annotations in a tabular format.
convert_coco_json_to_csv.py: Python script used to read and convert the COCO annotations.

Annotation schema

The dataset annotations are provided in MS-COCO JSON format. We also provide the annotations in CSV format for convenience. The documents contain the following fields:

Categories: List of conditions/pathologies
Images: Metadata of the original chest X-ray images. The images need to be separately downloaded from MIMIC-CXR / MIMIC-CXR-JPG projects.
Annotations: Each entry in the annotations field represents a bounding box with an associated sentence describing a condition/pathology. Images may have multiple associated annotations.

An example annotation in MS-COCO JSON format is shown below:

{
    "info": {
        "year": "2024,
        "version": "1.1.0"
        "description": "MS-CXR Locally Aligned Phrase Grounding Annotations",
        "contributor": "Microsoft",
        "date_created": "2024-07-01",
        "url": "https://arxiv.org/abs/2204.09817"
    },
    "licenses": [
        {
            "url": "https://www.physionet.org/about/licenses/physionet-credentialed-health-data-license-150/",
            "id": 1,
            "name": "PhysioNet Credentialed Health Data License 1.5.0"
        }
    ],
    "categories": [
        {
            "id": 0,
            "name": "Pneumothorax",
            "supercategory": "disease"
        },
        ...
    ],
    "images": [
        {
            "id": 16,
            "file_name": "c436cddb-4126f15e-59c0733c-34b5a4b5-bbda7ffd.jpg",
            "width": 2539,
            "height": 3050,
            "num_annotations": 3,
            "path": "/datasetdrive/MIMIC-CXR-V2/mimic-cxr-jpg-2.0.0.physionet.org/files/p15/p15928453/c436cddb-4126f15e-59c0733c-34b5a4b5-bbda7ffd.jpg",
        },
        ...
    ],
    "annotations": [
        {
            "id": 18,
            "image_id": 16,
            ...
            "split": "val"
        }
    ]
}

Patient Demographics

The average age of subjects in MS-CXR is higher than the average for all subjects in MIMIC-CXR. These findings are concordant with prior work [12] and we explain this observation with the fact that we do not sample studies from healthy subjects that do not display any anomalous findings and who are statistically likely to be younger. Similarly, we do not expect gender bias to be present due to our sampling, as none of the pathologies we sample are gender specific. Overall, MS-CXR does not deviate far from the MIMIC-CXR distribution.

Distribution of the annotation pairs (image bounding-box and sentence) across different clinical findings. The demographic statistics (e.g., gender, age) of the subjects are collected from MIMIC-IV v1.0.0 dataset for MS-CXR and all MIMIC-CXR.
Findings	# annotation pairs	# subjects	gender - F (%)	avg age (std)
Atelectasis	61	61	28 (45.90%)	64.52 (15.95)
Cardiomegaly	333	282	135 (47.87%)	68.10 (14.81)
Consolidation	117	109	40 (36.70%)	60.08 (17.67)
Edema	46	42	18 (42.86%)	68.79 (14.04)
Lung opacity	82	82	33 (40.24%)	62.07 (17.20)
Pleural effusion	96	95	41 (43.16%)	66.36 (15.29)
Pneumonia	182	146	65 (44.52%)	64.32 (17.17)
Pneumothorax	245	151	66 (43.71%)	60.71 (18.04)
Total	1162	851	382 (44.89%)	64.37 (16.61)
Background (all MIMIC-CXR)	-	65379	34134 (52.39%)	56.85 (19.47)

Split

Below we provide sample-level and patient-level statistics for each finding category across the splits.

Split-level statistics. Each subject can have multiple annotations. Splits are on a subject level in 70:15:15 ratio, stratified by finding category and gender. Gender was obtained from MIMIC-IV v1.0.0.
Findings	# image-annotation pairs			# subjects
	Train	Val	Test	Train	Val	Test
Atelectasis	44	9	8	44	9	8
Cardiomegaly	232	48	53	195	42	45
Consolidation	87	15	15	79	15	15
Edema	32	6	8	30	5	7
Lung Opacity	57	13	12	57	13	12
Pleural Effusion	67	15	14	66	15	14
Pneumonia	127	25	30	102	22	22
Pneumothorax	171	38	36	106	21	24
Total	817	169	176	595	128	128

Usage Notes

We are releasing the MS-CXR dataset to encourage reproducible evaluation of joint latent semantics learnt by biomedical image-text models. Accurate local alignment between these two modalities is an essential characteristic of successful joint image-text training in healthcare since image and report samples often contain multiple clinical findings. In an associated paper [13], we provide comprehensive evaluations of current state-of-the-art multi-modal models and a promising approach to improve the models further.

The dataset annotations are provided in MS-COCO format. Any library/API (e.g. cocoapi) supporting the MS-COCO format can be used to load the annotations. The annotations are also provided in CSV format for convenience.

We have provided a recommended patient-level split of the data. We encourage users of the dataset to follow this split if possible and necessary (e.g. if some data is required for training), to facilitate comparison in the literature.

Release Notes

v1.1.0

Split metadata added. This adds a column "split" to the CSV, and a new key "split" at the annotation level in the JSON. All other information is unchanged.

We note that the previous version of MS-CXR was denoted in PhysioNet as v0.1 whereas the files themselves used v1.0.0 for naming. The files are now v1.1.0 and the PhysioNet project is now also referred to as v1.1.0.

Ethics

MS-CXR is as a research artifact of the corresponding work, Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing. In this capacity, MS-CXR facilitates reproducibility and serves as an addition to the benchmarking landscape. The dataset is released with instances chosen from the public MIMIC-CXR v2 image-text dataset. As such, ethical considerations of that project should be taken into considerations in addition to those provided below.

MS-CXR contains a large number of samples covering 8 findings, which were balanced to ensure coverage for all findings and curated to ensure gold-standard evaluation of phrase grounding. To ensure a high quality, consistent benchmark, the phrase-image samples that do not adhere to guidelines (detailed in the corresponding work) are filtered out, including phrases containing multiple abnormalities in distinct lung regions.

In concordance with existing research [12], the application of filters results in a dataset that is both slightly older (average age 64.37 vs 56.85 in all MIMIC-CXR v2) and slightly less female (percentage female 44.89% vs 52.39% in all MIMIC-CXR). While these are relatively small shifts and the primary intention of this dataset is to facilitate reproducibility as a benchmark, we have disclosed this both alongside the dataset and in the corresponding work.

Acknowledgements

The authors would also like to thank Hannah Murfet for the guidance offered as part of the compliance review of the datasets used in this study, and Dr Maria Wetscherek and Dr Matthew Lungren for their clinical input and data annotations provided to this study.

Lastly, the released MS-CXR dataset has been built upon the following public data and benchmarks, thus the authors would like to thank the contributors of these datasets:

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. arXiv preprint arXiv:2012.15029, 2020
Nguyen, H. Q., Pham, H. H., tuan linh, l., Dao, M., & khanh, l. (2021). VinDr-CXR: An open dataset of chest X-rays with radiologist annotations (version 1.0.0). PhysioNet. https://doi.org/10.13026/3akn-b287.
George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1) :e180041, 2019.
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2097–2106. IEEE Computer Society, 2017.
Bigolin Lanfredi, R., Zhang, M., Auffermann, W., Chan, J., Duong, P., Srikumar, V., Drew, T., Schroeder, J., & Tasdizen, T. (2021). REFLACX: Reports and eye-tracking data for localization of abnormalities in chest x-rays (version 1.0.0). PhysioNet. https://doi.org/10.13026/e0dj-8498.
Bigolin Lanfredi, R., Zhang, M., Auffermann, W., Chan, J., Duong, P., Srikumar, V., Drew, T., Schroeder, J., & Tasdizen, T. (2021). REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Sci Data 9, 350 (2022). https://doi.org/10.1038/s41597-022-01441-z
Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. https://doi.org/10.13026/4jqj-jw95.
Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0
Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew Lungren. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1500–1519. Association for Computational Linguistics, 2020.
Joy T Wu, Nkechinyere Nneka Agu, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward Christopher Dee, William G Mitchell, Satyananda Kashyap, Andrea Giovannini, et al. Chest imagenome dataset for clinical reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Fabian Falck, Ozan Oktay, Anja Thieme, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland (2024). MAIRA-2: Grounded Radiology Report Generation. arXiv:2406.04449
Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD. Biases introduced by filtering electronic health records for patients with "complete data". J Am Med Inform Assoc. 2017 Nov 1;24(6):1134-1141. doi: 10.1093/jamia/ocx071. PMID: 29016972; PMCID: PMC6080680.
Boecking B, Usuyama N, Bannur S, Castro D.C., Schwaighofer A, Hyland S, Wetscherek M, Naumann T, Nori A, Alvarez-Valle J, Poon H, and Oktay O. 2022. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, Oct 23–27, 2022, Proceedings, Part XXXVI. Springer-Verlag 1–21. https://doi.org/10.1007/978-3-031-20059-5_1
L.K. Tam, X. Wang, E. Turkbey, K. Lu, Y. Wen, and D. Xu. Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2020, March 2020.