Name: CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation
Published: Feb. 3, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Restricted Access

Yuxiang Liao , Hoisang Heung , Hantao Liu , Irena Spasic

Published: Feb. 3, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Liao, Y., Heung, H., Liu, H., & Spasic, I. (2025). CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation (version 1.0.0). PhysioNet. https://doi.org/10.13026/p7kf-t860.

MLA	Liao, Yuxiang, et al. "CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation" (version 1.0.0). PhysioNet (2025), https://doi.org/10.13026/p7kf-t860.
APA	Liao, Y., Heung, H., Liu, H., & Spasic, I. (2025). CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation (version 1.0.0). PhysioNet. https://doi.org/10.13026/p7kf-t860.
Chicago	Liao, Yuxiang, Heung, Hoisang, Liu, Hantao, and Irena Spasic. "CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation" (version 1.0.0). PhysioNet (2025). https://doi.org/10.13026/p7kf-t860.
Harvard	Liao, Y., Heung, H., Liu, H., and Spasic, I. (2025) 'CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/p7kf-t860.
Vancouver	Liao Y, Heung H, Liu H, Spasic I. CXRGraph: Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation (version 1.0.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/p7kf-t860.

Additionally, please cite the original publication:

Liao, Y., Xiang, H., Liu, H., & Spasić, I. (2024). Using Information Extraction to Normalize the Training Data for Automatic Radiology Report Generation. IEEE Access, 1–1. https://doi.org/10.1109/access.2024.3504378 ‌

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

CXRGraph is a dataset of structured radiology reports dataset following the RadGraph format, which has been tailored for the Automatic Radiology Report Generation (ARRG) task. CXRGraph assorts clinical information from full-text radiology reports into five entity types and four relation types similar to RadGraph. CXRGraph introduces three entity attributes, which are optionally associated with an entity to provide additional information (e.g. abnormality) and handle hallucinated prior references for the ARRG task. We manually annotated the reports originally formatted in RadGraph, including 550 MIMIC-CXR reports for model training and evaluation and 50 CheXpert reports for evaluating the model generalization ability. By using the ground-truth data, we developed a joint entity and relation model, achieving a micro-F1 of 96.6% and 96.1% on named entity recognition, 94.0% and 89.8% on entity attribute recognition, and 89.5% and 86.6% on relation extraction, on the MIMIC-CXR and CheXpert test sets, respectively. Using the trained model, we automatically annotated 227,835 MIMIC-CXR reports. Both the ground-truth and inference data are available in CXRGraph. Given that the MIMIC-CXR and RadGraph have been de-identified already, no protected health information (PHI) is included.

Background

Narrative radiology reports vary excessively in their language, length, and style, thereby limiting their utility in clinical research and other downstream applications. This issue has given rise to the idea of automatic structuring of radiology reports. It focuses on extracting key medical information from free text, typically through named entity recognition (NER) and relation extraction (RE). Given a text document, NER identifies text sequences that correspond to predefined entity types; while RE assigns predefined relation types to pairs of such entities. Completing this task involves efforts in three aspects: data, scheme, and model. The data defines the scope for model training and application, the scheme comprises predefined entity and relation types to guide human annotators in data annotation, and the model is trained using annotated data to make predictions.

Automatic Radiology Report Generation (ARRG) focuses on utilizing deep learning methods to generate reports from radiology images, with an emphasis on recognizing normal and abnormal appearances and describing them accurately and comprehensively. Our recent review of ARRG suggested that structured reports may alleviate the inherent diversity of natural language, thus contributing to more accurate results in the model training and evaluation [1]. Although numerous studies have investigated extracting information from radiology reports [2-7], they may require specific adaptations for ARRG. For instance, radiology reports commonly include comparisons of observations between the current and previous examinations in the findings and impression sections. However, de-identification processes applied in existing open-source large-scale datasets such as MIMIC-CXR [8] make it impossible to identify the corresponding prior references, leading to inconsistencies between the reported observation and those observed in the current image. Using such inconsistent data to train ARRG models will inevitably produce hallucinations [9].

To better adapt to the ARRG task, we propose an extension of the RadGraph scheme [2], which was originally designed to capture the most clinically relevant information within the report in a consistent manner. Our extended scheme, referred to as CXRGraph, introduces three entity attributes: one of which indicates the normality of observations, while the other two attributes indicate the interval change of observations that refer to the priors, and the action needed to convert these relative observations to direct observations. We define the entity attribute as a value that is chosen from a set of predefined values and bound to an entity. Additionally, we introduced a new entity type and a new relation type to address observed inconsistencies and disagreements in the annotated results within RadGraph. More details are discussed in the Methods section.

Methods

Annotation Scheme

Our scheme adopted the definitions of entity and relation from RadGraph, including four entity types: Anatomy, Observation-Present, Observation-Absent, and Observation-Uncertain, and three relation types: Suggestive-Of, Located-At, and Modify. Additionally, we introduced a newa entity type: Location-Attribute, a new relation type: Part-Of, and three entity attributes: Normality, Action, and Change.

An entity is a continuous span of text

Anatomy (ANAT) refers to an anatomical body part corresponding to a specific observation, such as "lung".
Observation (OBS) refers to an identifiable pathophysiologic process, a diagnostic disease, or an observed feature. Observation is considered as Observation-Present by default unless a certainty or negation modifier is seen in its context. For example, the pneumothorax in "there is no pneumothorax" is regarded as Observation-Absent; the pneumonia in "possible left lower lobe pneumonia" is regarded as Observation-Uncertain. An entity attribute is a value chosen from a set of predefined values and can optionally be bound to an entity.
Location-Attribute (LOC-ATT) serves as a complement to the Located-At relation, indicating a special spatial relation between Observation and Anatomy, such as "terminates 2.3 cm above". We form it as (OBS, Located-At, LOC-ATT, Located-At, ANAT).

A relation is a directed link pointing from a subject entity to an object entity. The scope of permitted use of a relation is denoted in format: (subject entity type, relation type, object entity type).

Suggestive-Of (OBS, Suggestive-Of, OBS) indicates that the presence of the subject Observation can infer, or is secondary to, that of the object Observation.
Located-At (OBS, Located-At, ANAT) indicates spatial or other relations between Observation and Anatomy.
Modify (OBS, Modify, OBS) or (ANAT, Modify, ANAT) indicates that the subject entity modifies the second entity regarding size, change, degree, scope, etc. Accordingly, the modifier shares the same entity type as its modifying object.
Part-Of (OBS, Part-Of, OBS) or (ANAT, Part-Of, ANAT), which extends from Modify, indicates that the subject entity is part of the object entity (e.g. vascular, Part-Of, pulmonary) or is the property of the object entity (e.g. silhouette, Part-Of, hilar). Consequently, the object entity of the Located-At relation in CXRGraph may differ from RadGraph when the Part-Of relation appears.

An entity attribute is a value chosen from a set of predefined values and can optionally be bound to an entity.

Normality has two values: Normal and Abnormal, which are optional on Observation-Present and Observation-Absent, indicating whether a group of related entities is describing normal or not. By default, Observation-Present is considered to describe an abnormal observation while Observation-Absent describes a normal observation. Hence, no Normality value is assigned in these default cases due to their disproportionately large occurrence. When a non-default condition arises, a Normality value is assigned to the corresponding modifier. For example, the clear in "lungs are clear" is annotated as Observation-Present with Normal.
Action is optional on an Observation entity. It indicates how we would convert a group of Observation entities into direct observation. It contains two values: Removable and Essential.
- Removable means that the entity with this attribute, along with relevant entities pointing to it, can be directly removed without altering the original meaning of this group of entities.
- Essential indicates that the entity cannot be directly removed otherwise the meaning of this Observation group would change.
Change is optional on an Observation entity. It indicates the progression of an observation in the interval of time since a specific prior examination, having three values: Positive, Negative, and Unchanged. It is typically assigned to a change modifier along with an Action value. For example, the stable in "cardiomediastinal contours are stable" is considered as Observation-Present with Essential and Unchanged, whereas in "the heart size is enlarged, but stable", the stable is considered as Observation-Present with Removable and Unchanged.

Data Annotation

We manually annotated 600 radiology reports taken from RadGraph, including 425, 75, and 50 MIMIC-CXR reports to be used for training, validation, and testing, respectively. The test data also included 50 CheXpert reports. The patients associated with reports do not overlap between the training/validation/testing splits. Annotation was performed using the Brat Rapid Annotation Tool [10]. Two annotators first independently annotated 50 test reports from MIMIC-CXR by adapting original RadGraph annotations. Inter-annotator agreement (IAA) was measured using the F1-score as suggested by Grouin et al. [11], resulting in 97.2% for entity types, 87.3% for entity attributes and 91.0% for relation types. However, we still noticed a nontrivial inconsistency in the original RadGraph annotations. Given that only 45 pilot reports were utilized for training the annotators and refining the scheme for RadGraph, we attribute such inconsistency primarily to the difficulty in designing an annotation scheme that adequately covers the diversity of natural language expressions. It results in different understanding between different annotators on how the scheme should be applied to uncommon samples. We, therefore, asked two annotators to jointly annotate the whole dataset, resolve any disagreements through discussion, and subsequently improve the scheme. The annotated data were revised if unseen patterns of disagreement emerged. The overall IAAs between our CXRGraph and RadGraph, with CXRGraph converted into its compatible version with RadGraph, are 92.7% on the entity type and 77.3% on the relation type.

Similar to RadGraph, the annotation scope of reports in CXRGraph is confined to the sections related to findings and impression as these sections are of primary interest in ARRG [1]. Besides, more granular entity is annotated in priority, for example, "left lower lobe" is considered as three Anatomy entities rather than one. On the other hand, both the MIMIC-CXR reports and CheXpert reports involved in this dataset were already de-identified. The personal health information (PHI) in MIMIC-CXR reports was replaced with three consecutive underscores [8], while that in CheXpert reports was replaced with fake PHI [12].

Model Training and Data Inference

With the newly annotated data, we trained a joint entity and relation model, which is a pipeline approach that utilizes two models to extract entity and relation separately. Our model is designed based on PURE entity model [13] and PL-Marker relation model [14]. We initialized our model using pretrained weights from PubMedBERT [15]. Our model achieved adequate performance on MIMIC-CXR reports with micro-F1 scores of 96.62%, 94.04%, and 89.50% on entity type, entity attribute, and relation type, respectively. Our approach also demonstrated appropriate generalization performance on CheXpert reports, achieving F1 scores of 96.06%, 89.79%, and 86.64% on entity type, entity attribute, and relation type, respectively. By using the joint model, we automatically annotated 227,835 reports from MIMIC-CXR. Both the ground-truth and inference data are included in the released dataset.

We used a micro-F1 score to evaluate the model. For NER, a predicted entity span is considered correct if its boundaries and entity type are correct. Entity attribute recognition was evaluated using the same criteria. For RE, we used strict evaluation that considers a predicted relation as correct if the boundaries and entity types of two entity spans are correct and the relation type is correct. For the RadGraph dataset, we calculated the average F1 score between its two subsets, MIMIC-CXR and CheXpert.

Data Description

File Tree

cxrgraph
├── manual_data
│   ├── train.json
│   ├── dev.json
│   ├── test.json
│   ├── test_mimic.json 
│   └── test_chexpert.json
├── inference.json
├── examples.pdf
├── data_dictionary.md
└── README.md

File Description

manual_data: This folder contains the manually annotated data, including three JSON files corresponding to train/dev/test splits and two extra JSON files that splits the test set according to the data source (MIMIC-CXR or CheXpert) to which they belong. The splits of dataset are identical to RadGraph. All data are annotated by two annotators cooperatively to ensure the consistency.
- train.json: This file contains 425 MIMIC-CXR reports.
- dev.json: This file contains 75 MIMIC-CXR reports.
- test.json: This file contains 50 MIMIC-CXR reports and 50 CheXpert reports.
- test_mimic.json: This file contains the 50 MIMIC-CXR reports in test.json
- test_chexpert.json: This file contains the 50 CheXpert reports in test.json
- Within these JSON files, each line represents a report formatted in JSON with five keys:
  - doc_key: A unique identifier for a report. For MIMIC-CXR reports, it is the relative path of the report in the MIMIC-CXR dataset, such as "p10/p10001884/s58196907.txt". For CheXpert reports, it is a number, such as "17".
  - sentences: It is a nested list in which the first dimension denotes the sentence of a report and the second dimension are the tokens within a sentence. We obtain the report tokens from RadGraph and split the report sentences by a period symbol.
  - ner: It is a nested list in which the first dimension denotes the sentence of a report and the second dimension indicate the entities within a sentence. Each entity is represented as a list containing three elements: the starting and ending indices of the entity span within the report, and the entity type.
  - relations: It is a nested list in which the first dimension denotes the sentence of a report and the second dimension indicates the corresponding relations for subject entities within a sentence. Each relation is represented as a list containing five elements: the starting and ending indices of the subject entity span, the starting and ending indices of the object entity span, and the relation type.
  - entity_attributes: It is a nested list in which the first dimension denotes the sentence of a report and the second dimension indicates the attribute sets of entities within a sentence. Each attribute set is represented as a list containing five elements: the starting and ending indices of the entity span, three attribute values regarding Normality, Action, and Change. "NA" indicates that the attribute corresponding to this position is not assigned an attribute value. If none of the three attributes are assigned a value, the entity will be removed from the list rather than showing three "NA" values.
inference.json: This file contains 227,835 MIMIC-CXR reports annotated by our model. In this file, each line represents a report formatted in JSON with five keys:
- doc_key: A unique identifier for a report. It is the relative path of the report in the MIMIC-CXR dataset where path separators are replaced by underscore symbols, such as "p10_p10001884_s58196907.txt".
- sentences: It is a nested list in which the first dimension denotes the sentence of a report and the second dimension are the tokens within a sentence. We obtain the free-text reports from the MIMIC-CXR dataset and tokenzie them using the regular expression provided by RadGraph. We split the report sentences by a period symbol.
- pred_ner: See the definition of the key "ner" in manual_data.
- pred_rel: See the definition of the key "relations " in manual_data.
- pred_attr: See the definition of the key "entity_attributes " in manual_data.
examples.pdf: This file illustrates some example sentences in CXRGraph compared to RadGraph.
README.md: This file provides an overview of the dataset structure and instructions for accessing the data and model.
data_dictionary.md: This file provides more detailed explanations of variables in each JSON file.

Usage Notes

The dataset is free to use if researchers adhere the data usage agreement. The data has been used to train a joint entity and relation model. Researchers may (1) use our manually annotated data to train models to extract information from radiology reports, (2) use our trained model or inference data to enhance the quality of textual data in the automatic radiology report generation task. Relevant codes and models of this work are available on GitHub [16].

There are several limitations of this work. First, the data are annotated by one cross-disciplinary expert with experience in clinical data processing and one expert with medical background. Radiology images are not consulted as reference during the annotation process. If we are unsure of the label for an entity or relationship, we adopt the label from RadGraph. Second, the model prediction on entity type "Observation-Uncertain" (F1=78.3%) and relation type "Suggestive-Of" (F1=68.6%) performs worse than that on overall entity types (F1=96.6%) and relation types (F1=89.5%). We assume this can be attributed to the expression bias of natural language in the original report, moreover, the annotation inconsistencies of these data are more perceptually noticeable than those of other data in RadGraph. However, reasons regarding the long-range context between the subject and object entities of this relation type and the insufficient training samples cannot be excluded. Third, as stated in RadGraph, the annotations do not capture the clinical context in a radiology report, such as information included in the Comparison or History section of the report; and the annotations are limited to chest X-ray radiology reports from MIMIC-CXR for the train set and MIMIC-CXR / CheXpert for the test set.

Release Notes

Version 1.0.0

Ethics

Our dataset is constructed using previously publicly available deidentified datasets: MIMIC-CXR and CheXpert. Both projects are IRB-approved. Throughout the dataset development process, we treated all radiology reports as sensitive information and adhered to the usage guidelines of the source datasets. To safeguard data privacy and security, all LLM training and data generation processes were conducted in a secure and local environment.

Acknowledgements

This work is part of a PhD project funded by China Scholarship Council-Cardiff University Scholarship (CSC202108140022). The scholarship had been awarded to Y.L. The project is supervised by I.S. and H.L.

Conflicts of Interest

No conflicts of interest to declare.

References

Liao Y, Liu H, Spasić I. Deep learning approaches to automatic radiology report generation: A systematic review. Informatics in Medicine Unlocked. 2023;39:101273.
Jain S, Agrawal A, Saporta A, Truong SQ, Duong DN, Bui T, et al., editors. RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; 2021.
Spasic I, Zhao B, Jones CB, Button K. KneeTex: an ontology–driven system for information extraction from MRI reports. Journal of Biomedical Semantics. 2015;6(1):34.
Sugimoto K, Takeda T, Oh J-H, Wada S, Konishi S, Yamahata A, et al. Extracting clinical terms from radiology reports with deep learning. Journal of Biomedical Informatics. 2021;116:103729.
Hassanpour S, Langlotz CP. Information extraction from multi-institutional radiology reports. Artificial Intelligence in Medicine. 2016;66:29-39.
Datta S, Ulinski M, Godfrey-Stovall J, Khanpara S, Riascos-Castaneda RF, Roberts K, editors. Rad-SpatialNet: A Frame-based Resource for Fine-Grained Spatial Relations in Radiology Reports2020 May; Marseille, France: European Language Resources Association.
Sameer Khanna AD, Kibo Yoon, Steven QH Truong, Hanh Duong, Agustina Saenz, Pranav Rajpurkar, editor RadGraph2: Modeling Disease Progression in Radiology Reports via Hierarchical Information Extraction. Proceedings of the 8th Machine Learning for Healthcare Conference; 2023: PMLR.
Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C-y, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data. 2019;6(1):317.
Endo M, Krishnan R, Krishna V, Ng AY, Rajpurkar P. Retrieval-Based Chest X-Ray Report Generation Using a Pre-trained Contrastive Language-Image Model. In: Subhrajit R, Stephen P, Emma R, Girmaw Abebe T, Luis O, Fabian F, et al., editors. Proceedings of Machine Learning for Health; Proceedings of Machine Learning Research: PMLR; 2021. p. 209--19.
Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii Ji, editors. brat: a Web-based Tool for NLP-Assisted Text Annotation2012 April; Avignon, France: Association for Computational Linguistics.
Grouin C, Rosset S, Zweigenbaum P, Fort K, Galibert O, Quintard L, editors. Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview2011 June; Portland, Oregon, USA: Association for Computational Linguistics.
Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence; Honolulu, Hawaii, USA: AAAI Press; 2019. p. Article 73.
Zhong Z, Chen D, editors. A Frustratingly Easy Approach for Entity and Relation Extraction2021 June; Online: Association for Computational Linguistics.
Ye D, Lin Y, Li P, Sun M, editors. Packed Levitated Marker for Entity and Relation Extraction2022 May; Dublin, Ireland: Association for Computational Linguistics.
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthcare. 2021;3(1):Article 2.
Yuxiang Liao. Code for CXRGraph: GitHub; 2024 [Available from: https://github.com/yxliao95/cxrgraph.