Name: FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark
Published: Jan. 21, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Mingjie Li , Wenjia Cai , Rui Liu , Yuetian Weng , Tengfei Liu , Cong Wang , xin chen , zhong liu , Caineng Pan , Mengke Li , yingfeng zheng , Yizhi Liu , Flora Salim , Karin Verspoor , Xiaodan Liang , Xiaojun Chang

Published: Jan. 21, 2025. Version: 1.1.0

When using this resource, please cite: (show more options)
Li, M., Cai, W., Liu, R., Weng, Y., Liu, T., Wang, C., chen, x., liu, z., Pan, C., Li, M., zheng, y., Liu, Y., Salim, F., Verspoor, K., Liang, X., & Chang, X. (2025). FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark (version 1.1.0). PhysioNet. https://doi.org/10.13026/f7w3-gm74.

MLA	Li, Mingjie, et al. "FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark" (version 1.1.0). PhysioNet (2025), https://doi.org/10.13026/f7w3-gm74.
APA	Li, M., Cai, W., Liu, R., Weng, Y., Liu, T., Wang, C., chen, x., liu, z., Pan, C., Li, M., zheng, y., Liu, Y., Salim, F., Verspoor, K., Liang, X., & Chang, X. (2025). FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark (version 1.1.0). PhysioNet. https://doi.org/10.13026/f7w3-gm74.
Chicago	Li, Mingjie, Cai, Wenjia, Liu, Rui, Weng, Yuetian, Liu, Tengfei, Wang, Cong, chen, xin, liu, zhong, Pan, Caineng, Li, Mengke, zheng, yingfeng, Liu, Yizhi, Salim, Flora, Verspoor, Karin, Liang, Xiaodan, and Xiaojun Chang. "FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark" (version 1.1.0). PhysioNet (2025). https://doi.org/10.13026/f7w3-gm74.
Harvard	Li, M., Cai, W., Liu, R., Weng, Y., Liu, T., Wang, C., chen, x., liu, z., Pan, C., Li, M., zheng, y., Liu, Y., Salim, F., Verspoor, K., Liang, X., and Chang, X. (2025) 'FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark' (version 1.1.0), PhysioNet. Available at: https://doi.org/10.13026/f7w3-gm74.
Vancouver	Li M, Cai W, Liu R, Weng Y, Liu T, Wang C, chen x, liu z, Pan C, Li M, zheng y, Liu Y, Salim F, Verspoor K, Liang X, Chang X. FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark (version 1.1.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/f7w3-gm74.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Automatic medical report generation (MRG) towards describing life-threatening lesions from given medical images, such as Chest X-ray and Fundus Fluorescein Angiography (FFA), has been a long-standing research topic in machine learning and automatic medical diagnosis fields. However, existing MRG benchmarks only provide medical images and free-text reports without explainable annotations and reliable evaluation tools, hindering the current research advances from two aspects: First, existing methods can only predict reports without accurate explanation, undermining the trustworthiness of the diagnostic methods; Second, the comparison between predicted reports from MRG methods is unreliable based on the natural language generation (NLG) metrics.

To address these issues, we propose an explainable and reliable MRG benchmark based on FFA Images and Reports (FFA-IR). Specifically, our FFA-IR dataset is featured from the following aspects: 1) Large-scale medical dataset. FFA-IR collects 766 reports along with 47,247 FFA images from clinical practice. 2) Explainable annotation. FFA-IR annotates 46 categories of lesions with a total of 12,166 regions. 3) Bilingual reports. FFA-IR provides both English and Chinese reports for each case. We hope that our FFA-IR can significantly advance research from both vision-and-language and medicine fields and improve the conventional retinal disease diagnosis procedure.

Background

The World Health Organization (WHO) estimates that 2.2 billion people are suffering from visual impairment, and 500 million of them are caused by specific retinal diseases such as age related macular degeneration (AMD) and diabetic retinopathy (DR) [1]. Fundus Fluorescein Angiography (FFA) and Color Fundus Photography (CFP) is one of the most common and essential examination methods in the differentiation, diagnosis, treatment, and prognosis of fundus ophthalmic diseases. Compared with CFP, FFA is a high-cost, invasive and complex treatment but with a high confirmation rate. As some patients may be allergic to fluorescein, FFA is not suitable for large-scale screening. Therefore, it is arduous and high-cost to collect large-scale FFA images and reports. Furthermore, FFA is a kind of dynamic imaging procedure, with sodium fluorescein flowing through the blood into the fundus vessels, the whole procedure can be divided into five parts: Preaterial, Arterial, Arteriovenous, Venous, and Late period.

At different periods, ophthalmologists determine different diseases base on the morphology of different lesions. For example, the nature of new blood vessels in different areas from the fluorescein leakage pattern, the scope and size of the non-perfusion area of the retina. After browsing all the FFA images, ophthalmologists will select a bunch of typical FFA images according to the observation and write the diagnosis report. Therefore, reading and making diagnosis over dozens of FFA images is laborious. A practical, interpretable, and reliable MRG model can assist ophthalmologists in understanding these images and improve the conventional retinal disease diagnosis procedure.

Methods

Data were collected from retrospective cohorts of Bright Eye Hospital in Wuhan, China from patients admitted between November 2016 and December 2021. Institutional review board (MR-42-23-023506) and ethics committee approval were obtained in Bright Eye Hospital. The study followed the tenets of the Declaration of Helsinki [2]. All angiography images and reports were anonymized and de-identified before the analysis.

Over the study period, the hospital system recorded reports, containing findings, impressions, and clinical information, along with DOCAMs in which clinical information and pixel values of an FFA image are stored. Reports and FFA images were excluded from the final dataset where:

Reports and FFA images could not be matched by the case ID;
Pixel values were missing upon conversion of the DOCAMs to JPG format:
Reports were incomplete, with key information such as findings or impressions missing.

The final FFA-IR dataset comprised 766 reports and 47,247 FFA images. It should be noted that the clinical reports were written in Chinese. To facilitate the reuse of the dataset by a broader, non-Chinese speaking audience, we employed "DeepL Translator" software to automatically translate the reports to English [3]. We then invited seven bilingual ophthalmologists to proofread the automatically translated reports. These ophthalmologists were also invited to label lesion regions along with the reports and FFA images to explain the diagnosis procedure. These ophthalmologists are divided into two groups, annotation and validation teams, respectively. The annotation team first annotated all the data, and then the validation team checked the annotation to alleviate the human errors.

Data Description

Explainable annotation: The generated medical reports aim to describe the size, location, and period of detected lesions instead of predicting the disease category.

We invited ophthalmologists to annotate lesion information from FFA to explain the diagnosis procedure. We then evaluate the accuracy of the models' explanations. Our FFA-IR contains 46 kinds of retinal lesions, such as Cystoid Macular Edema (CME) and Diabetic Macular Edema (DME), which covers most typical retinal lesions. The ophthalmologists annotated each lesion with its minimum enclosing rectangle and also provided the lesion category. All the lesions in one FFA are recorded in a dict format, and the key name is the combination of the case ID and the image name while the value is a list data, and each element contains the category and positional information. Furthermore, these lesions are described in the given reports, which can be considered as prior medical knowledge to connect the visual groundings and linguistic information. Benefits from these annotations, the explainability of models can also be evaluated.

Bilingual reports: To fit more researchers, we translate these reports to English and provide bilingual reports simultaneously. As it is laborious to translate tens of millions of reports, we firstly employ DeepL Translator to automatically translate all the reports and invite bilingual ophthalmologists to proofread them. Due to the particularity of the Chinese language, we also provide a vocabulary containing medical nomenclature to help researchers tokenize the Chinese reports. Along with the bilingual reports, FFA-IR is the first benchmark to qualitatively and quantitatively evaluate the influences of different languages.

Moreover, FFA-IR can also facilitate the development of multi-modal machine translation models.

File Overview

The dataset consists of the following:

FFAIR: An directory contains all the images.
report.json: This file contains all the annotations, including case_id, Chinese reports, and English reports.
lesion_info.json: This file contains all the lesion information, for each key-value, the key name is the patient_id/image_id, and the value contains all the lesions in this image with lesion category and position information.

Usage Notes

First, all these data were collected from retrospective cohorts of Bright Eye Hospital in Wuhan, China. Institutional Review Board approvals were obtained. The study followed the tenets of the Declaration of Helsinki [2]. All angiography images and reports were anonymized and de-identified before the analysis.

Our FFA-IR can be used in various medical image analysis domains along with lesion annotations and bilingual reports. We highly recommend three cases. The first one is to develop an explainable and reliable MRG model to describe lesions in Chinese or English reports and diagnose retinal diseases using FFA-IR. Secondly, as the dynamic imaging procedure, exploring temporal information or interaction to improve lesion detection or disease classification should be encouraged. The last case is to develop a multi-modal machine translation model. It is welcome to investigate whether medical images can facilitate aligning the source and target sentences in the latent space. Prior errors exist due to the unbalanced distributions across attributes, such as gender and age. As recommended by Saahil et al.[4], researchers should audit performance disparities across these attributes when developing clinical models.

Limitations

The FFA-IR dataset has a number of limitations. First, data were collected only from Bright Eye Hospital in Wuhan, China. Second, as the original reports are collected from clinical practice, various writing patterns belonging to different ophthalmologists can be observed in FFA-IR, affecting the automatic metrics. Third, there are still several rare lesions that are not captured in FFA-IR. Fourth, FFA-IR suffers data bias due to the unbalanced distribution of pathological statistics in nature. Fifth, training models in FFA-IR may require considerable GPU resources, as models may have to read >80 images for each case on average.

Release Notes

We are excited to announce a comprehensive update to our dataset. The entire dataset has been refreshed with new data sourced from the Bright Eye Hospital in Wuhan, China. This update enhances the quality and relevance of the data, ensuring it meets the highest standards for research and development.

Key Updates:

Complete Data Refresh: The dataset now exclusively contains updated data from Bright Eye Hospital, providing more accurate and recent information.
Enhanced Image Descriptions: Detailed descriptions of the images have been included, offering improved context and clarity for analysis.
New Institutional Review Board (IRB) Approval: The dataset is now accompanied by a newly approved Institutional Review Board (IRB), ensuring ethical and compliant use of the data.

We believe these enhancements will significantly contribute to the quality and reliability of retinal vision and language research.

Ethics

Data were collected from retrospective cohorts of Bright Eye Hospital in Wuhan, China from patients admitted between November 2016 and December 2019 to 2021. Institutional review board (MR-42-23-023506) and ethics committee approval were obtained in Bright Eye Hospital.

Acknowledgements

This work is partially supported by the Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) under DE190100626. We would like to acknowledge Bright Eye Hospital in Wuhan, China for their data collection support. We would like to acknowledge the VoTT, Labelme, and Rect Label for providing us labeling platform. We would like to acknowledge DeepL for their automatic translation support.

Conflicts of Interest

No conflicts of interest to declare.

References

Pizzarello L, Abiose A, Ffytche T, Duerksen R, Thulasiraj R, Taylor H, et al. Vision 2020: The right to sight: a global initiative to eliminate avoidable blindness. Arch Ophthalmol. 2004;122(4):615–20.
World Medical Association. World Medical Association Declaration of Helsinki: Ethical principles for medical research involving human subjects. Bull World Health Organ. 2001;79(4):373.
DeepL. AI Assistance for Language [Internet]. Available from: https://www.deepl.com/ [cited 2021 Aug 27].
Jain S, Agrawal A, Saporta A, Truong SQ, Duong DN, Bui T, et al. Radgraph: Extracting clinical entities and relations from radiology reports. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Round 1; 2021.