Database Credentialed Access
FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark
Mingjie Li , Wenjia Cai , Rui Liu , Yuetian Weng , Xiaoyun Zhao , Cong Wang , Xin Chen , Zhong Liu , Caineng Pan , Mengke Li , Yingfeng Zheng , Yizhi Liu , Flora Salim , Karin Verspoor , Xiaodan Liang , Xiaojun Chang
Published: Sept. 21, 2021. Version: 1.0.0
FFA-IR dataset is unavailable until further notice (Sept. 6, 2023, 3:48 p.m.)
The authors of the FFA-IR dataset have asked for downloads to be disabled until further notice to adhere with local policy changes. We apologize for the inconvenience and hope to make the files available again in the future.
When using this resource, please cite:
(show more options)
Li, M., Cai, W., Liu, R., Weng, Y., Zhao, X., Wang, C., Chen, X., Liu, Z., Pan, C., Li, M., Zheng, Y., Liu, Y., Salim, F., Verspoor, K., Liang, X., & Chang, X. (2021). FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark (version 1.0.0). PhysioNet. https://doi.org/10.13026/ccbh-z832.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Automatic medical report generation (MRG) towards describing life-threatening lesions from given medical images, such as Chest X-ray and Fundus Fluorescein Angiography (FFA), has been a long-standing research topic in machine learning and automatic medical diagnosis fields. However, existing MRG benchmarks only provide medical images and free-text reports without explainable annotations and reliable evaluation tools, hindering the current research advances from two aspects: First, existing methods can only predict reports without accurate explanation, undermining the trustworthiness of the diagnostic methods; Second, the comparison between predicted reports from MRG methods is unreliable based on the natural language generation (NLG) metrics.
To address these issues, we propose an explainable and reliable MRG benchmark based on FFA Images and Reports (FFA-IR). Specifically, our FFA-IR dataset is featured from the following aspects: 1) Large-scale medical dataset. FFA-IR collects 10,790 reports along with 1,048,584 FFA images from clinical practice. 2) Explainable annotation. FFA-IR annotates 46 categories of lesions with a total of 12,166 regions. 3) Bilingual reports. FFA-IR provides both English and Chinese reports for each case. We hope that our FFA-IR can significantly advance research from both vision-and-language and medicine fields and improve the conventional retinal disease diagnosis procedure.
Background
The World Health Organization (WHO) estimates that 2.2 billion people are suffering from visual impairment, and 500 million of them are caused by specific retinal diseases such as age related macular degeneration (AMD) and diabetic retinopathy (DR) [1]. Fundus Fluorescein Angiography (FFA) and Color Fundus Photography (CFP) is one of the most common and essential examination methods in the differentiation, diagnosis, treatment, and prognosis of fundus ophthalmic diseases. Compared with CFP, FFA is a high-cost, invasive and complex treatment but with a high confirmation rate. As some patients may be allergic to fluorescein, FFA is not suitable for large-scale screening. Therefore, it is arduous and high-cost to collect large-scale FFA images and reports. Furthermore, FFA is a kind of dynamic imaging procedure, with sodium fluorescein flowing through the blood into the fundus vessels, the whole procedure can be divided into five parts: Preaterial, Arterial, Arteriovenous, Venous, and Late period.
At different periods, ophthalmologists determine different diseases base on the morphology of different lesions. For example, the nature of new blood vessels in different areas from the fluorescein leakage pattern, the scope and size of the non-perfusion area of the retina. After browsing all the FFA images, ophthalmologists will select a bunch of typical FFA images according to the observation and write the diagnosis report. Therefore, reading and making diagnosis over dozens of FFA images is laborious. A practical, interpretable, and reliable MRG model can assist ophthalmologists in understanding these images and improve the conventional retinal disease diagnosis procedure.
Methods
Data were collected from retrospective cohorts of Zhongshan Ophthalmic Center of Sun Yat-sen University in Guangzhou, China, from patients admitted between November 2016 and December 2019. Institutional review board (No.2021KYPJ039) and ethics committee approval were obtained in Zhongshan Ophthalmic Center, Sun Yat-sen University. The study followed the tenets of the Declaration of Helsinki [2]. All angiography images and reports were anonymized and de-identified before the analysis.
Over the study period, the hospital system recorded a total of 15,232 reports, containing findings, impressions, and clinical information, along with 1,716,825 DOCAMs in which clinical information and pixel values of an FFA image are stored. Reports and FFA images were excluded from the final dataset where:
- Reports and FFA images could not be matched by the case ID;
- Pixel values were missing upon conversion of the DOCAMs to JPG format:
- Reports were incomplete, with key information such as findings or impressions missing.
The final FFA-IR dataset comprised 10,790 reports and 1,048,584 FFA images. It should be noted that the clinical reports were written in Chinese. To facilitate reuse of the dataset by a broader, non-Chinese speaking audience, we employed "DeepL Translator" software to automatically translate the reports to English [3]. We then invited seven bilingual ophthalmologists to proofread the automatically translated reports. These ophthalmologists were also invited to label lesion regions along with the reports and FFA images to explain the diagnosis procedure. These ophthalmologists are divided into two groups, annotation and validation team, respectively. The annotation team firstly annotated all the data, and then the validation team checked the annotation to alleviate the human errors.
Data Description
Explainable annotation: The generated medical reports aim to describe the size, location, and period of detected lesions instead of predicting the disease category.
We invited ophthalmologists to annotate lesion information from FFA to explain the diagnosis procedure. We then evaluate the accuracy of the models' explanations. Our FFA-IR contains 46 kinds of retinal lesions, such as Cystoid Macular Edema (CME) and Diabetic Macular Edema (DME), which covers most typical retinal lesions. The ophthalmologists annotated each lesion with its minimum enclosing rectangle and also provided the lesion category. All the lesions in one FFA are recorded in a dict format, and the key name is the combination of the case ID and the image name while the value is a list data, and each element contains the category and positional information. Furthermore, these lesions are described in the given reports, which can be considered as prior medical knowledge to connect the visual groundings and linguistic information. Benefits from these annotations, the explainability of models can also be evaluated.
Bilingual reports: To fit more researchers, we translate these reports to English and provide bilingual reports simultaneously. As it is laborious to translate tens of millions of reports, we firstly employ DeepL Translator to automatically translate all the reports and invite bilingual ophthalmologists to proofread them. Due to the particularity of the Chinese language, we also provide a vocabulary containing medical nomenclature to help researchers tokenize the Chinese reports. Along with the bilingual reports, FFA-IR is the first benchmark to qualitatively and quantitatively evaluate the influences of different languages.
Moreover, FFA-IR can also facilitate the development of multi-modal machine translation models.
File Overview
The dataset consists of the following:
- FFAIR: An directory contains all the compressed files, named fair.tar.gz.*. To extract all the files, please first download all the files in FFAIR, and use the command "cat FAIR.tar.gz.* | tar -zxv". Then the name of each directory refers to the case ID, and all the FFA images are provided.
- ffair_annotation.json: This file contains all the annotations, including case_id, Chinese reports, English reports, impressions, findings, split, the name of FFA images, and the gender and age of patients.
- lesion_info.json: This file contains all the lesion information, for each key-value, the key name is the case_id/image_name, and the value contains all the lesions in this image with lesion category and position information.
- lesion_dict.json: This file contains a dict to explain the index of each lesion in lesion_info.json. For each key-value pair, the key is the index of the lesion, the value provides both Chinese and English names of this lesion.
Usage Notes
First, all these data were collected from retrospective cohorts of Zhongshan Ophthalmic Center of Sun Yat-sen University in Guangzhou, China. Institutional Review Board approvals were obtained. The study followed the tenets of the Declaration of Helsinki [2]. All angiography images and reports were anonymized and de-identified before the analysis.
Our FFA-IR can be used in various medical image analysis domains along with lesion annotations and bilingual reports. We highly recommend three cases. The first one is to develop an explainable and reliable MRG model to describe lesions in Chinese or English reports and diagnose retinal diseases using FFA-IR. Secondly, as the dynamic imaging procedure, exploring temporal information or interaction to improve lesion detection or disease classification should be encouraged. The last case is to develop a multi-modal machine translation model. It is welcome to investigate whether medical images can facilitate aligning the source and target sentences in the latent space. Prior errors exist due to the unbalanced distributions across attributes, such as gender and age. As recommended by Saahil et al.[4], researchers should audit performance disparities across these attributes when developing clinical models.
Limitations
The FFA-IR dataset has a number of limitations. First, data are only collected from Zhong Shan Ophthalmic Center of Sun Yat-sen University in Guangzhou, China. Second, as the original reports are collected from clinical practice, various writing patterns belonging to different ophthalmologists can be observed in FFA-IR, affecting the automatic metrics. Third, there are still several rare lesions that are not captured in FFA-IR. Fourth, FFA-IR suffers data bias due to the unbalanced distribution of pathological statistics in nature. Fifth, training models in FFA-IR may require considerable GPU resources, as models may have to read >80 images for each case on average.
Acknowledgements
This work is partially supported by the Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) under DE190100626, and the National Natural Science Foundation of China (82171034). We would like to acknowledge Prof.Feng Wen from Fundus Department, Zhongshan Ophthalmic Center, Sun Yat-sen University for his data collection support. We would like to acknowledge the VoTT, Labelme, and Rect Label for providing us labeling platform. We would like to acknowledge DeepL for their automatic translation support.
Conflicts of Interest
No conflicts of interest to declare.
References
- Pizzarello, L., Abiose, A., Ffytche, T., Duerksen, R., Thulasiraj, R., Taylor, H., Faal, H., Rao, G., Kocur, I., Resnikoff, S.: Vision 2020: The right to sight: a global initiative to eliminate avoidable blindness. Archives of ophthalmology122(4), 615–620 (2004)
- Association, W.M., et al.: World medical association declaration of helsinki. ethical principles for medical research involving human subjects. Bulletin of the World Health Organization79(4), 373 (2001)
- DeepL: AI Assistance for Language. https://www.deepl.com/ [Accessed: 27 August 2021]
- Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y.,Lungren, M.P., Ng, A.Y., Langlotz, C., Rajpurkar, P.: Radgraph: Extracting clinical entities and relations from radiology reports. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Round 1 (2021)
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/ccbh-z832
DOI (latest version):
https://doi.org/10.13026/k5rp-9h43
Topics:
fundus fluorescein angiography
explainable and reliable evaluation
vision and language
medical report generation