Database Credentialed Access

ReXPref-Prior: A MIMIC-CXR Preference Dataset for Reducing Hallucinated Prior Exams in Radiology Report Generation

Oishi Banerjee Hong-Yu Zhou Subathra Adithan Stephen Kwak Kay Wu Pranav Rajpurkar

Published: Aug. 14, 2024. Version: 1.0.0


When using this resource, please cite: (show more options)
Banerjee, O., Zhou, H., Adithan, S., Kwak, S., Wu, K., & Rajpurkar, P. (2024). ReXPref-Prior: A MIMIC-CXR Preference Dataset for Reducing Hallucinated Prior Exams in Radiology Report Generation (version 1.0.0). PhysioNet. https://doi.org/10.13026/t13x-4r94.

Additionally, please cite the original publication:

Banerjee, O., Zhou, H.-Y., Adithan, S., Kwak, S., Wu, K., & Rajpurkar, P. (2024). Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2406.06496

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Generative vision-language models have exciting potential implications for radiology report generation, but unfortunately such models are also known to produce hallucinations and other nonsensical statements. For example, radiology report generation models regularly hallucinate prior exams, making statements such as “The lungs are hyperinflated with emphysematous changes as seen on prior CT” despite not having access to any prior exam. To address this shortcoming, we propose ReXPref-Prior, an adapted version of MIMIC-CXR where GPT-4 has removed references to prior exams from both findings and impression sections of chest X-ray reports. We expect ReXPref-Prior will be useful for training models that hallucinate prior exams less frequently, through techniques such as direct preference optimization. Additionally, ReXPref-Prior’s validation and test sets can be used as a new benchmark for evaluating report generation models.


Background

Generative vision-language models (VLMs) have exciting potential implications for radiology report generation, but unfortunately VLMs are also known to produce hallucinations and other nonsensical statements that reduce their utility. For example, radiology report generation models regularly hallucinate prior exams, making statements such as “The lungs are hyperinflated with emphysematous changes as seen on prior CT” or “There has been interval increase in right-sided opacity [i.e. compared to a previous image],” despite not having access to any prior exam [1,2,3,4]. In clinical practice, these hallucinations would force clinicians to spend extra effort checking and editing these references; if left uncorrected, these hallucinations could mislead readers about a patient’s progress, with the potential for harm.

To address this shortcoming, we propose ReXPref-Prior, an adapted version of MIMIC-CXR where GPT-4 has removed references to prior exams from both findings and impression sections of chest X-ray reports. We expect ReXPref-Prior will be useful for training models that hallucinate prior exams less frequently, through techniques such as direct preference optimization [5]. Additionally, ReXPref-Prior’s validation and test sets can be used as a new benchmark for evaluating report generation models; compared to MIMIC-CXR, evaluation on ReXPref-Prior more effectively penalizes models that frequently hallucinate prior exams.

 


Methods

MIMIC-CXR combines three types of data: electronic health record data, chest X-rays (CXR), and free-text radiology reports [6]. Our dataset specifically involves the radiology reports, which can be linked back to their corresponding chest X-rays through anonymized study identifiers. There is one report per study. Please see the MIMIC-CXR PhysioNet page for additional details.

Data Selection

We follow MIMIC-CXR-JPG's established train/validation/test splits, which contain 222,758, 1,808, and 3,269 study IDs respectively [7]. The three splits contain no overlapping patients. We first exclude reports without clear headers marking a findings or impression section (“FINDINGS:”, “FINDINGS AND IMPRESSION:”, “IMPRESSION”, “CONCLUSION”, or “SUMMARY”). For the training set, we then randomly sample 20,000 reports with the substring “compar” in their findings or impression section, as these reports are especially likely to reference prior exams. We include all reports with clear headings in the validation and test sets.

Next, we process these reports with GPT-4 to remove references to prior exams. For the training set, we remove reports whose GPT-4 outputs have invalid formats (e.g. claiming that a line requires no changes but also rewriting it). For the validation and test sets, a research assistant manually fixes GPT-4 outputs with invalid formats. After these steps, our training set contains 19,838 studies, while the validation set contains 915 studies and the test set contains 1,383 studies.

Technical Implementation

We use GPT-4 to remove references to prior exams from the findings and impression sections of MIMIC-CXR reports. We first divide reports into sentences by splitting on the substring ". ". We then pass each report’s lines into GPT-4 and prompt it to classify each line into three categories, depending on how much of the line depends on a prior exam.

1.) Not dependent on a prior exam. Example: “Cardiac silhouette is enlarged.” 

2.) Partially dependent on a prior exam. Example: “Cardiac silhouette is again enlarged.”

3.) Entirely dependent on a prior exam, with no usable information about the current image. Example: “Cardiac silhouette is unchanged.”

We then produce an edited version of each line, based on how it is classified. If a line is not dependent on a prior exam, we make no changes. If a line is partially dependent on a prior exam, we prompt GPT-4 to rewrite it without any reference to a prior exam (“Cardiac silhouette is again enlarged.” → “Cardiac silhouette is enlarged.”). If a line is entirely dependent on a prior exam, we essentially delete it by replacing it with an empty string “”.

We find that GPT-4 substantially reduces the frequency of references to prior exams, with an estimated 2.5x reduction in lines mentioning prior exams across all splits. To achieve this estimate, we use an automatic metric that classifies each line, depending on whether it contains one of 43 phrases that are commonly associated with prior exams ("more", "regress", etc.). Detailed results from this analysis are given in our accompanying paper [8].


Data Description

Each .csv file contains the fields orig_sentence and new_sentence, which give the original and edited versions of sentences. The section field shows whether the sentence came from the findings or impression section of a report.  The sentence_id field gives the zero-indexed position of orig_sentence within the original report. study_id matches the original MIMIC-CXR study identifiers.

  • train_rexprefprior.csv: Contains 107,305 sentences from 19,838 reports.
  • val_rexprefprior.csv: Contains 10,224 sentences from 1,705 reports.
  • test_rexprefprior.csv: Contains 18,823 sentences from 2,919 reports.

 


Usage Notes

We introduced this dataset in our paper “Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation” [8]. In this paper, we used this dataset both to fine-tune radiology report generation models so they avoid hallucinating prior exams and to evaluate those models. This data can be reused to fine-tune models so they hallucinate fewer prior exams and to evaluate models from any training procedure. Users should be aware that some references to prior exams remain even after GPT-4’s processing and that GPT-4 occasionally introduces unrealistic wording or deletes extra information.

 


Release Notes

This version of RexPref-Prior is almost identical to the one used in our paper [8]. The only difference is that we have fixed a minor formatting error in how sentences were split, which had previously affected 49 rows.


Ethics

The benefits of our work include providing model-agnostic methods and data to reduce hallucinated prior exams. As far as we can tell, our project has no significant risks.

The study involved only secondary analysis of previously collected data available on PhysioNet.

 


Acknowledgements

Many thanks to Julian Acosta for valuable discussions on this dataset. We also thank the Microsoft Accelerating Foundation Models Research Grant for providing Azure compute credits.

 


Conflicts of Interest

The authors have no conflicts of interest to declare.

 


References

  1. Hyland, S. L., Bannur, S., Bouzid, K., Castro, D. C., Ranjit, M., Schwaighofer, A., Pérez-García, F., Salvatelli, V., Srivastav, S., Thieme, A., Codella, N., Lungren, M. P., Wetscherek, M. T., Oktay, O., & Alvarez-Valle, J. (2023). MAIRA-1: A specialised large multimodal model for radiology report generation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2311.13668
  2. Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., Fonseca, E. K. U., Lee, H. M. H., Abad, Z. S. H., Ng, A. Y., Langlotz, C. P., Venugopal, V. K., & Rajpurkar, P. (2023). Evaluating progress in automatic chest X-ray radiology report generation. Patterns, 0(0). https://doi.org/10.1016/j.patter.2023.100802
  3. Ramesh, V., Chi, N. A., & Rajpurkar, P. (2022). Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors. In A. Parziale, M. Agrawal, S. Joshi, I. Y. Chen, S. Tang, L. Oala, & A. Subbaswamy (Eds.), Proceedings of the 2nd Machine Learning for Health symposium (Vol. 193, pp. 456–473). PMLR.
  4. Miura, Y., Zhang, Y., Tsai, E. B., Langlotz, C. P., & Jurafsky, D. (2020). Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2010.10042
  5. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36. https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html
  6. Johnson, A. E. W., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C.-Y., Mark, R. G., & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6(1), 317.
  7. Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.1.0). PhysioNet.
  8. Banerjee, O., Zhou, H.-Y., Adithan, S., Kwak, S., Wu, K., & Rajpurkar, P. (2024). Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2406.06496

Parent Projects
ReXPref-Prior: A MIMIC-CXR Preference Dataset for Reducing Hallucinated Prior Exams in Radiology Report Generation was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files