Database Restricted Access

Application of Med-PaLM 2 in the refinement of MIMIC-CXR labels

Kendall Park Rory Sayres Andrew Sellergren Tom Pollard Fayaz Jamil Timo Kohlberger Charles Lau Atilla Kiraly

Published: Feb. 4, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Park, K., Sayres, R., Sellergren, A., Pollard, T., Jamil, F., Kohlberger, T., Lau, C., & Kiraly, A. (2025). Application of Med-PaLM 2 in the refinement of MIMIC-CXR labels (version 1.0.0). PhysioNet. https://doi.org/10.13026/7wmp-jx90.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

MIMIC-CXR is a large, open source dataset that is widely-used in medical AI research. One of the limitations of this dataset is the lack of ground truth labels for the chest X-ray studies. Prior work has extracted structured labels from the MIMIC-CXR radiology report text using CheXpert, a natural language processing (NLP) model. As comprehensive expert validation of these labels is cost-prohibitive, there exists a need for scalable methods of identifying NLP-derived labels that would benefit from manual review. We have developed prompts for extraction of clinically-relevant labels using a clinically-trained large language model, Med-PaLM 2, which we selectively applied to MIMIC-CXR radiology reports. A subset of cases where the Med-PaLM 2 results differed from the previously published CheXpert labels were reviewed by three US board certified radiologists to establish a ground truth. Of these differing labels, Med-PaLM 2 achieved an accuracy of 66%, compared to 19% of CheXpert. Our results demonstrate the potential use of medically-oriented large language models such as Med-PaLM 2 in both label extraction and identifying cases for manual review. This dataset offers 1,378 radiologist-verified ground truth labels to the MIMIC-CXR project.


Background

Generative AI, with its ability to learn from vast amounts of healthcare data and create new, realistic outputs, holds immense potential to transform the healthcare landscape. However, data availability remains a significant bottleneck for healthcare research efforts. To our best knowledge, MIMIC-CXR is the largest open source radiology dataset that features images and associated report text [1].

One of the major limitations of the MIMIC-CXR dataset is the limited number of radiologist-verified ground-truth labels for the chest X-Ray (CXR) findings. While understandable—annotating 227,827 reports across over a dozen conditions would be a substantial undertaking—this lack of labels hinders the dataset's utility for certain tasks requiring distinct diagnostic categories. MIMIC-CXR-JPG 2.0.0 has provided labels based on two different Natural Language Processing (NLP) models, CheXpert and NegBio [2] with proposed train, validation, and test splits for future work. Radiologist-verified ground truth for a small subset of the test set were released in the update MIMIC-CXR-JPG 2.1.0 [3]. Collecting ground-truth labels for the test set alone could require well over 100,000 annotations (the product of 3,269 reports, 12+ conditions, and at least three radiologists for adjudication).

Our contributions offer an AI-driven alternative to a brute force approach of label refinement and a step towards establishing ground-truth in the test set. First, we demonstrate how a large-language model such as Med-PaLM 2 can be used to concur with prior NLP labels [2] and identify cases for manual review. Second, we offer ground-truth labels for selected cases and conditions following expert review by three US board-certified radiologists.


Methods

Figure 1 uploaded as part of the data.

Figure 1. Inclusions and exclusions of report-finding pairs for expert review.

Our data was sourced from the radiology reports contained in MIMIC-CXR and the CheXpert labels provided by MIMIC-CXR-JPG 2.0.0. Reports were matched to candidate findings using keyword search. Reports and potential findings were sent to Med-PaLM 2 for determination of whether the finding was present in the report. Med-Palm 2 labels that differed from CheXpert [2] were identified, a subset of which were sent for manual review to establish ground truth. As a final step, we compared the Med-PaLM 2 and CheXpert differing labels with this ground truth.

MedPalm Queries

Given the sparsity of a wide range of findings in CXR [4], we performed an initial keyword search to match reports with potential findings of interest. Med-PaLM 2, a large language model trained on medical data, was queried twice with reports and each associated key terms using two different prompt templates. 23,824 queries were sent to Med-PaLM 2. The vast majority of the responses were "Ye", or "No", however some wordier responses had to be manually resolved.

Med-PaLM 2 labels were aggregated and compared with the associated CheXpert labels. The Med-PaLM 2 label for a particular finding was considered positive if at least one of the queries returned an affirmative response for at least one of the associated key terms (Figure 2). Of the total report-finding pairs that were compared, roughly half agreed with the CheXpert labels with the atelectasis and consolidation findings providing the highest quantity of agreements. Cases where Med-Palm 2 and CheXpert disagreed on the finding label were identified. A subset of these disagreements (Figure 1) were selected for manual review, with focus on identifying positive labels while reducing the total number of labels for radiologists to review.

[Bot] I'm a helpful radiology assistant, who provides concise answers to questions about information in a chest x-ray report.

[User] Determine the answer to the following question: [Does the patient have a fracture?], 

given the context of the follow chest x-ray report:

<REPORT TEXT>

Do not mention conditions or parts of the report not relevant to the question.

Make sure to only answer: [Does the patient have a fracture?]

[Bot]

An example query for Med-PaLM 2 using Prompt Template 1. The term fracture was replaced with terms for other findings.

You are a helpful medical knowledge assistant. Provide useful, complete, concise, and scientifically-grounded queries to radiology reports.

Does this report mention that the patient has a fracture? Report:
<REPORT TEXT>

An example query for Med-PaLM 2 using Prompt Template 2. The term fracture was replaced with associated key terms (Figure 2) for other findings.

Atelectasis: 'atelectasis'

Cardiomegaly: 'cardiac silhouette', 'cardiomegaly'

Consolidation: 'consolidation'

Edema: 'pulmonary vascular congestion', 'pulmonary edema'

Fracture: 'fracture'

Lung Lesion: 'lung nodule', 'nodule', 'nodular opacity'

Lung Opacity: 'airspace opacity', 'airspace opacities', 'lung opacities', 'lung opacity'

Pleural Effusion: pleural effusion

Pneumonia: pneumonia

Pneumothorax: pneumothorax

Support Devices: 'enteric tube', 'endotracheal tube', 'feeding tube', 'et tube', ' og ', 'picc', 'central line', 'picc', 'central venous catheter', ' ng '

Figure 2. Findings and associated key terms.

Reader Study

1378 differing labels between Med-PaLM 2 and CheXpert were sent for manual review by three US board-certified radiologists. For each finding and associated report, radiologists selected one of four possible labels defined by MIMIC-CXR-JPG: positive, negative, uncertain, and not mentioned [2]. There was strong interrater agreement (Fleiss' κ = 0.71); reviewers were unanimous for 77% of the labels. In cases of disagreement between reviewers, majority vote was used (21%), and when all three reviewers disagreed (2%) a senior academic thoracic radiologist provided the final determination.

Analysis

When Med-PaLM 2 and CheXpert disagreed on a label, the CheXpert label matched ground truth 19% of the time. In comparison, Med-PaLM 2's label matched ground truth 66% of the time. The majority of Med-PaLM 2's incorrect labels were for the two values outside its binary classification task: uncertain (80%) and missing (11%).

 

Med-PaLM 2 Correct

CheXpert Correct

Both Incorrect

Total

condition

             

atelectasis

166

(75.1%)

6

(2.7%)

49

(22.2%)

221

cardiomegaly

130

(88.4%)

0

(0.0%)

17

(11.6%)

147

consolidation

14

(26.4%)

36

(67.9%)

3

(5.7%)

53

edema

40

(64.5%)

9

(14.5%)

13

(21.0%)

62

fracture

79

(78.2%)

19

(18.8%)

3

(3.0%)

101

lung lesion

36

(69.2%)

8

(15.4%)

8

(15.4%)

52

lung opacity

30

(96.8%)

0

(0.0%)

1

(3.2%)

31

pleural effusion

109

(52.2%)

64

(30.6%)

36

(17.2%)

209

pneumonia

130

(41.9%)

110

(35.5%)

70

(22.6%)

310

pneumothorax

33

(66.0%)

8

(16.0%)

9

(18.0%)

50

support devices

139

(97.9%)

2

(1.4%)

1

(0.7%)

142

total

906

(65.7%)

262

(19.0%)

210

(15.2%)

1378

Comparison of the differing labels between MedPaLM 2 and CheXpert.


Data Description

Ground Truth Labels

Ground truth labels of the differing study-finding pairs can be found in the compressed CSV file mimic-cxr-2.0.0-validated.csv.gz. We have also provided the level of agreement between the expert reviewers for a more granular picture of label uncertainty.

  • subject_id - MIMIC-CXR identifier generated for each patient [1].
  • study_id - MIMIC-CXR identifier generated for each study [1].
  • finding - One of the following previously defined findings [5]:
    • Atelectasis
    • Cardiomegaly
    • Consolidation
    • Edema
    • Fracture
    • Lung Lesion
    • Lung Opacity
    • Pleural Effusion
    • Pneumonia
    • Pneumothorax
    • Support Devices
  • reader_label - The adjudicated reader label, following the schema set by MIMIC-CXR-JPG [2]:
    • 1.0 - The label was positively mentioned in the associated study, and is present in one or more of the corresponding images.
    • 0.0 - The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images.
    • -1.0 - The label was either: (1) mentioned with uncertainty in the report, and therefore may or may not be present to some degree in the corresponding image, or (2) mentioned with ambiguous language in the report and it is unclear if the pathology exists or not.
    • Missing (empty element) - No mention of the label was made in the report.
  • reader_agreement - The level of agreement on each label by the expert reviewers:
    • 1 - all three readers had agreed on the label.
    • 2 - two of the three readers agreed on the label, majority label was selected.
    • 3 - all three readers disagreed on the label; final decision made by a fourth reviewer.

Usage Notes

Our work is based on MIMIC-CXR-JPG 2.0.0 [2]. Between the completion of this project and our submission to Physionet, MIMIC-CXR-JPG published a very relevant update to their database. The 2.1.0 version of MIMIC-CXR-JPG features 687 reports from MIMIC-CXR with findings annotated by a single radiologist (two of the study_ids do not match any of the study_ids in the MIMIC-CXR dataset, leaving 685) [3]. Between these two labeled datasets, there is an overlap of only 263 report-finding pairs. Users of MIMIC-CXR may combine these two sets to extend the number of human-verified labels. Of the 263 overlapping report-finding pairs, less than half of the labels are in agreement. 87% of the disagreements involved cases where the radiologist annotator of MIMIC-CXR-JPG 2.1.0 marked the finding as "missing". Of the 263 overlapping cases, 98% had a majority consensus by our three radiologist viewers.

Our comparison of these two different ground truth datasets demonstrates the importance of labeling methodology in the establishment of ground truth data. We prompted our radiologists to evaluate the report-finding pairs by explicitly choosing one of the four options. No default values were assumed. A replication of three gave us the ability to identify and adjudicate reader error. While replication tripled our cost of labeling, we were able to leverage MedPaLM 2 and prior NLP work to focus efforts on harder cases. CheXpert and NegBio NLP labels demonstrate high agreement when evaluated against the MIMIC-CXR-JPG 2.1.0 ground truth test set (> 90% agreement for most findings). In other words, well over 90% of ground truth labels reviewed by the radiologist in the 2.1.0 update were redundant with the previous labels provided by the NLP models. This creates a more compelling case for the use of clinical LLMs like Med-PaLM 2 in label refinement—by identifying which NLP-derived labels should be sent for manual review, finite expert labeling resources can cover a greater portion of the dataset.

The dataset offers additional utilization beyond the refined labels provided. The given methodology can be used on different LLMs as a way of evaluating their ability to process radiology reports with the same prompt and reports. Updating prompts to explicitly state expected outputs in addition to an uncertain category may allow for an improved yield of relabeling candidates. In addition, the agreement among the three radiologists as well as the label discrepancies with MIMIC-CXR-JPG 2.1.0 can be used to identify particularly difficult cases for labeling for future research.

An internal sandboxed version of Med-PaLM 2 was used for this work. Similar capabilities are available via the MedLM API on Google Cloud [6] that do not need the preamble text in the prompts used. These methods should also extend to other suitable Large Language Model APIs.

These labels have been used in the evaluation of the Med-Gemini 2D model [7].


Release Notes

This is the initial release.


Ethics

Data consisted of radiology report text and NLP labels from the existing MIMIC-CXR and MIMIC-CXR-JPG 2.0.0 databases, respectively. All users of the data were credentialed by Physionet and had signed the Data Use Agreement. All data processing through LLMs and subsequent human labeling were conducted on secure internal servers in a sandboxed environment to ensure data privacy and compliance.


Acknowledgements

We would like to thank Jonas Kemp, Peter Clardy, Jonathan Krause, and Dale Webster for their feedback in reviewing the manuscript. Many thanks to Fayaz Jamil for championing the multiple data release processes. We would also like to thank the radiologists involved in reading the reports to establish the final ground truth label. All work was funded by Google LLC.


Conflicts of Interest

Some authors of this report are employees of Google LLC and own Alphabet stock.


References

  1. Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data. 2019 Dec 12;6(1):1–8.
  2. Johnson A, al. E. MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0) [Internet]. Available from: http://dx.doi.org/10.13026/8360-t248
  3. Johnson A, Lungren M, Peng Y, Lu Z, Mark R, Berkowitz S, et al. MIMIC-CXR-JPG - chest radiographs with structured labels [Internet]. 2024 [cited 2024 Dec 26]. Available from: https://physionet.org/content/mimic-cxr-jpg/2.1.0/
  4. Joarder R, Crundwell N. Chest X-Ray in Clinical Practice. Springer Science & Business Media; 2009. 195 p.
  5. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proc Conf AAAI Artif Intell. 2019 Jul 17;33(01):590–7.
  6. MedLM models overview [Internet]. Google Cloud. [cited 2024 Dec 16]. Available from: https://cloud.google.com/vertex-ai/generative-ai/docs/medlm/overview
  7. Yang L, Xu S, Sellergren A, Kohlberger T, Zhou Y, Ktena I, et al. Advancing Multimodal Medical Capabilities of Gemini [Internet]. 2024 [cited 2024 May 7]. Available from: http://arxiv.org/abs/2405.03162

Parent Projects
Application of Med-PaLM 2 in the refinement of MIMIC-CXR labels was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only registered users who sign the specified data use agreement can access the files.

License (for files):
PhysioNet Restricted Health Data License 1.5.0

Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0

Discovery
Corresponding Author
You must be logged in to view the contact information.

Files