Name: LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays
Published: Feb. 4, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Restricted Access

Elham Ghelichkhan , Tolga Tasdizen

Published: Feb. 4, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Ghelichkhan, E., & Tasdizen, T. (2025). LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays (version 1.0.0). PhysioNet. https://doi.org/10.13026/0pw2-je90.

MLA	Ghelichkhan, Elham, and Tolga Tasdizen. "LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays" (version 1.0.0). PhysioNet (2025), https://doi.org/10.13026/0pw2-je90.
APA	Ghelichkhan, E., & Tasdizen, T. (2025). LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays (version 1.0.0). PhysioNet. https://doi.org/10.13026/0pw2-je90.
Chicago	Ghelichkhan, Elham, and Tolga Tasdizen. "LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays" (version 1.0.0). PhysioNet (2025). https://doi.org/10.13026/0pw2-je90.
Harvard	Ghelichkhan, E., and Tasdizen, T. (2025) 'LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/0pw2-je90.
Vancouver	Ghelichkhan E, Tasdizen T. LATTE-CXR: Locally Aligned TexT and imagE, Explainable dataset for Chest X-Rays (version 1.0.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/0pw2-je90.

Additionally, please cite the original publication:

Elham Ghelichkhan and Tolga Tasdizen. “A Comparison of Object Detec- tion and Phrase Grounding Models in Chest X-ray Abnormality Localization Using Eye-tracking Data. 2025.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Local annotation of medical data is both expensive and time-consuming due to the high cost of expert annotators, the precision required for accurate annotation, and the inherent challenges of medical diagnosis. To address these problems, we developed LATTE-CXR, a chest X-ray dataset with locally aligned image-text pairs, derived from the REFLACX dataset. LATTE-CXR supports tasks requiring local image-text annotations, such as phrase grounding, caption-guided object detection, and image captioning with region-level descriptions. By extracting statements from radiology reports corresponding to REFLACX annotated abnormalities, this dataset includes 3926 bounding box-statement pairs (with repeating statements) from 1668 MIMIC-CXR image readings in the REFLACX dataset. Additionally, we automatically generated 13751 bounding box-sentence pairs from 2,742 chest X-ray readings, utilizing timestamped eye-tracking data and transcribed reports from REFLACX. The eye-tracking bounding boxes are linked to corresponding annotated bounding boxes if they share a sentence, providing a comprehensive framework for assessing model explainability.

Background

AI models can enhance diagnostic accuracy, reduce human error, and streamline the analysis of medical image and text, leading to faster and more reliable healthcare outcomes. However, locally annotated chest X-ray datasets are rare and small due to the costly and time-intensive process of annotation. For instance, MS-CXR dataset [1, 2] for phrase grounding, contains 1153 bounding boxes-statement annotations. To address this limitation, we leveraged the REFLACX dataset [3, 4], which includes annotations of labels and abnormality locations, to create locally aligned image regions and text pairs for tasks such as phrase grounding. Although LATTE-CXR may be less accurate than MS-CXR due to its automated generation without radiologist supervision, it offers a larger scalable dataset with locally aligned bounding box-text. LATTE-CXR annotated bounding boxes in are annotated by REFLACX experienced radiologists and we extracted their corresponding text from the same radiologist transcribed report. To ensure that a model's predictions align with medical reasoning and clinical practice, researchers analyze model explainability. We proposed using radiologists' eye-tracking data as a benchmark for assessing model explainability. Specifically, we automatically generated bounding boxes for 13,751 sentences, from REFLACX dataset Phase 2 and Phase 3 reports, leveraging timestamped eye-tracking data and transcriptions. Although the eye-tracking data is inherently noisy, the evaluation on REFLACX test set (mIoU=36.07) across five different radiologists, demonstrated that these bounding boxes are learnable by an AI model[5].

Methods

The dataset includes all readings from REFLACX [3, 4] phases 2 and 3 that have recorded eye-tracking data. Since REFLACX phase 3 originally contains only 6 images, we randomly reassigned some training images to the validation set, resulting in a split of 915/221 chest X-rays for training and validation, respectively.

We repurposed REFLACX annotations for tasks requiring locally aligned image regions and text. Each annotated abnormality corresponds to a bounding box (originally an ellipse in REFLACX) and is paired with a statement. A statement refers to the sentence(s) in the report that describe the abnormality/abnormalities enclosed by the bounding box. To extract these statements, we filtered out sentences indicating the absence of pathologies by removing those containing the term normal, excluding the term abnormal, and sentences including 'no ', such as No acute fracture. Then, for each bounding box, we concatenated sentences that referenced at least one of the annotated bounding box labels, according to our predefined_dictionary. As a result, a statement may include multiple sentences and can be paired with multiple bounding boxes that share the same labels. We discarded the bounding boxes for which we could not extract a statement.

For each bounding box, given its chest X-ray report, we used our predefined_dictionary to extract the sentences indicating the bounding box labels. The keys of the predefined_dictionary are the abnormality labels, and the values of this dictionary are the substrings we used to extract sentences from the report. For instance, for a bounding box labeled by abnormalities Atelectasis, Consolidation the keys are Atelectasis and Consolidation. Based on the predefined_dictionary below, the values are ['atelectasis', ['lung', 'loss', 'volume'], ['lob', 'loss', 'volume']] for Atelectasis and ['consolidation', 'airspace opaci'] for Consolidation. We extracted sentences from report, which include at least one of the substrings in ['atelectasis', ['lung', 'loss', 'volume'], ['lob', 'loss', 'volume']] or ['consolidation', 'airspace opaci'], and concatenated them. In this example, a report sentence will be chosen for the key Atelectasis, if atelectasis is a substring of the sentence, or all three ['lung', 'loss', 'volume'] are substrings of the sentence, or all three ['lob', 'loss', 'volume'] are substrings of the sentence. The extracted statement for this bounding box is "there is a left lower lobe opacity likely representing atelectasis or consolidation.".

predefined_dictionary = {'Abnormal mediastinal contour': ['mediastin'], 'Acute fracture': ['fracture', ['bone', 'discontinu']], 'Consolidation':['consolidation', 'airspace opaci'], 'Atelectasis': ['atelectasis', ['lung', 'loss', 'volume'], ['lob', 'loss', 'volume']], 'Hiatal hernia': ['hiatal hernia'], 'Pneumothorax': ['pneumothorax', ['punctured', 'lung'], ['lob', 'collapse'], 'collapased lung'], 'Pulmonary edema': ['pulmonary edema', ['edema', 'lung'], ['airway', 'wall', 'thick']], 'High lung volume / emphysema': [['high', 'lung', 'volume'], 'emphysema', ['high', 'respiratory', 'volume']], 'Groundglass opacity': ['glass opacit'], 'Interstitial lung disease': [['interstitial', 'lung'], ['diffuse', 'parenchymal', 'lung']], 'Lung nodule or mass': ['nodule', 'mass'], 'Pleural abnormality': ['pleural effusion', 'pleural thickening', 'pleural abnormalit', 'pleural effusion', 'pleural Other', 'pleural discorder'], 'Enlarged cardiac silhouette': [['cardi', 'enlarge'], 'cardiomegaly', 'megacardio', 'megalocardio', ['heart', 'enlarge']], 'Enlarged hilum': [['enlarge', 'hil']]}

Further, we assigned a single bounding box to each statement from the training and validation sets, when multiple bounding boxes corresponded to the same statement. The REFLACX bounding boxes are annotated with a certainty level, ranging from 1 to 5, reflecting radiologists' confidence in their abnormality annotations. For statements linked to multiple bounding boxes, we randomly selected one bounding box with the highest certainty level to pair with the statement. The chosen flag indicates whether a bounding box is the only one assigned to its statement. This flag is set to False for phase 3 test samples and all phase 2 samples, as we used the entirety of phase 2 as the test set in our study [5]. The mIoU= 50.52% and Accuracy=76.89% (IoU>30%) validates the alignment between the bounding boxes and statements.

Using eye-tracking data, we automatically assigned a bounding box to all reports' sentences. To generate an eye-tracking bounding box for a sentence, we utilized images, timestamped eye-tracking fixations, and timestamped medical reports through the following process:

Starting with the first sentence in a report, for each sentence, we collected fixations from the Pre-Sentence Interval (PSI) seconds before the sentence starts until it ends. Based on the analysis in [4], we set PSI=1.5s.
We assigned each fixation to the first sentence that covered its duration. For each fixation, we generated a Gaussian heatmap proportional to the fixation duration, following the approach in [6].
We summed the heatmaps of all fixations corresponding to a sentence and normalized the resulting heatmap to a range of [0, 255].
We filtered out pixels with an intensity smaller than 40% of the heatmaps maximum intensity and removed the objects smaller than 1/400 of the image size. These thresholds were determined through visual experiments.
Finally, we extracted the bounding box by enclosing the remaining heatmap region in a rectangle, resulting in the eye-tracking bounding box for the sentence.

These methods are scalable since they do not require human annotation. From datasets that already include bounding boxes, labels, and global textual descriptions, the repurposing approach can automatically generate locally aligned image-text pairs. Additionally, our second method can automatically create explainable bounding boxes for datasets containing timestamped eye-tracking data and textual descriptions, providing a valuable tool for model explainability studies.

Data Description

The global information of chest X-rays is provided in the root directory, metadata_phaseX.csv tables, where X equals 2 or 3, referring to REFLACX [3, 4] phases 2 or 3. These tables contain paths to read data from the MIMIC [7, 8, 9, 10] and REFLACX datasets. After discarding readings without eye-tracking data, phase 2 contains 240 readings, and phase 3 contains 2,507 readings. In phase 2, we extracted at least one bounding box-statement pair for 147 readings, and discarded 47 bounding boxes. In phase 3, we extracted at least one bounding box-statement pair for 1074 readings, discarding 369 bounding boxes. The metadata tables contain the following information:

./metadata_phaseX.csv

reading_id (string): the id of the reading to link with the REFLACX dataset.
split (string): indicates whether a chest X-ray is part of the train/validation/test set. some training samples from REFLACX were reassigned to the validations set.
REFLACX_split (string): original chest X-ray split in REFLACX and MIMIC-CXR datasets.
image_path (string): the chest X-ray image location in MIMIC-CXR dataset.
dicom_id (string): the id of the chest X-ray to link with the MIMIC-CXR dataset.
subject_id (string): the patient id from MIMIC-CXR and MIMIC-IV [11] datasets.
image_size_x (int): the horizontal chest X-ray image size in pixels, from REFLACX dataset.
image_size_y (int): the vertical chest X-ray image size in pixels, from REFLACX dataset.
has_annotated_abnormality (Boolean): True if this chest X-ray reading has at least one annotated ellipse in REFLACX dataset, False otherwise.
has_paired_bbox_statement (Boolean): True if this chest X-ray reading has at least one bounding box-statement pair in LATTE-CXR, False otherwise.
fixations: path to the REFLACX eye-tracking fixation csv file.
anomaly_location_ellipses: path to the REFLACX csv table containing coordinates and abnormality labels of annotated ellipses.
chest_bounding_box: path to the csv file in REFLACX dataset, containing coordinates of the drawn chest bounding box around the lungs and heart.
timestamps_transcription: path to the REFLACX csv file, containing timestamped transcribed words, stated by radiologists.
transcription: path to the report collected by REFLACX, for a reading.

For each reading, there is a subdirectory in the data directory named as its reading_id, containing a comma-separated table bbox_statement.csv (if has_annotated_abnormality is True for the reading in metadata_phase2.csv or metadata_phase3.csv) and one etbox_sentence.csv file. The total of 1668 chest X-ray readings has bbox_statement.csv tables. Each row in this table corresponds to one annotated bounding box as below, if we found its corresponding statement in the report:

./data/reading_id/bbox_statement.csv

box_index (int): the bounding box index in the table, starting from 1 and ending at the total number of annotated bounding boxes paired with a statement.
xmin, ymin, xmax, ymax (int): bounding box coordinates corresponding to REFLACX annotated ellipses.
certainty (int): certainty of the radiologists about annotating this bounding box in the REFLACX dataset.
statement (string): concatenation of one or more report sentences referring to the bounding box abnormality labels.
chosen (string): True if the bounding box is the only bounding box paired with its statement. When a statement refers to multiple bounding boxes, this flag is True for one with highest certainty.
Abnormal mediastinal contour, Acute fracture, Atelectasis, Consolidation, Enlarged cardiac silhouette, Enlarged hilum, Groundglass opacity, Hiatal hernia, High lung volume / emphysema, Interstitial lung disease, Lung nodule or mass, Other, Pleural abnormality, Pneumothorax, Pulmonary edema, Support devices (Boolean): True for the abnormalities annotated for the bounding box by REFLACX radiologists.

A total of 2742 readings contain generated eye-tracking-sentence pairs. Each row of etbox_sentence table contains data for one sentence of the report as below:

./data/reading_id/etbox_sentence.csv

Ann_box_index (int): 0 if the sentence in this row is not a substring of any annotated bounding box statements. Otherwise, the box_index assigned to the annotated bounding box, in the same directory, bbox_statement.csv file.
xmin, ymin, xmax, ymax (int): coordinates of the eye-tracking bounding box in the image space.
sentence (string): a sentence in the radiology report.

Usage Notes

LATTE-CXR includes chest X-rays from the MIMIC-CXR dataset [7, 8, 9, 10] along with eye-tracking fixations and annotated abnormalities from the REFLACX dataset [3, 4]; therefore, access to both datasets is essential to utilize LATTE-CXR.

The data contain training, validation, and test samples for objectives such as caption-guided object detection, image captioning with region-level description and phrase grounding. We have utilized these data for phrase grounding in our study [5]. Phrase grounding requires a single bounding box associated with a textual description for training and validation; in this case, the bounding boxes with chosen=True can be set as the bounding box aligned with the statement, for training and validation. For evaluation, for each statement aligned with multiple bounding boxes, the annotated bounding box that overlaps best with the model prediction can be chosen as ground-truth.

The eye-tracking bounding boxes represent the areas a radiologist has scanned to diagnose diseases and can be used to assess a model explainability. We have trained and evaluated the eye-tracking bounding boxes to validate learnability of these bounding boxes [5]. When a model's predictions fall within the eye-tracking bounding boxes, it suggests that the model's decision-making process aligns with human diagnostic reasoning and thereby it is explainable. We propose the containment ratio as a metric of evaluating model explainability [5]. The containment ratio can be calculated as:

CR = \frac{area\_size(ET \cap Prediction)}{area\_size(Prediction)}

The rationale for excluding the size of the eye-tracking bounding box from the denominator is that these bounding boxes are typically large, representing the entire region a radiologist scans to identify potential abnormalities. In contrast, the models are designed to predict the precise location of the abnormalities. Therefore, if a model's prediction falls within the eye-tracking bounding box - radiologist's region of interest- even if it does not cover the entire eye-tracking bounding box, it still is explainable.

One of the limitations of this dataset is the absence of radiologist supervision. Although we extracted sentences from the reports by searching reports for labeled diseases, collaboration with a radiologist could help validate these extractions, potentially allowing us to pair more bounding boxes with statements, rather than discarding them. Another limitation is the need for more robust ways to optimize the hyperparameters used in generating eye-tracking bounding boxes.

Release Notes

Initial release of the dataset

Ethics

We used the publicly available REFLACX dataset [3, 4], which is deidentified, so no further approval was required.

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Boecking B, Usuyama N, Bannur S, Coelho de Castro D, Schwaighofer A, Hyland S, Sharma H, Wetscherek MT, Naumann T, Nori A, Alvarez-Valle J, Poon H, & Oktay O (2024). MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 1.1.0). PhysioNet. https://doi.org/10.13026/9g2z-jg61.
Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, Wetscherek M, Naumann T, Nori A, Alvarez-Valle J, Poon H, & Oktay O. (2022). Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, 2022, Proceedings, Part XXXVI. Springer-Verlag 1–21. https://doi.org/10.1007/978-3-031-20059-5_1
Bigolin LR, Zhang M, Auffermann W, Chan J, Duong P, Srikumar V, Drew T, Schroeder J, & Tasdizen T (2021). REFLACX: Reports and eye-tracking data for localization of abnormalities in chest x-rays (version 1.0.0). PhysioNet. https://doi.org/10.13026/e0dj-8498.
Bigolin LR, Zhang M, Auffermann W, Chan J, Duong P, Srikumar V, Drew T, Schroeder J, & Tasdizen T (2021). REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Sci Data 9, 350 (2022). https://doi.org/10.1038/s41597-022-01441-z
Ghelichkhan E & Tasdizen T (2025). A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization Using Eye-tracking Data”. In: Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI). IEEE, 2025.
Bigolin LR, Schroeder JD, & Tasdizen T (2023). Localization supervision of chest x-ray classifiers using label-specific eye-tracking annotation. Frontiers in radiology. 3, p. 1088068.
Johnson A, Pollard T, Mark R, Berkowitz S, & Horng S (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet. https://doi.org/10.13026/C2JT1Q.
Johnson AEW, Pollard TJ, Berkowitz SJ et al (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317. https://doi.org/10.1038/s41597-019-0322-0
Johnson A, Lungren M, Peng Y, Lu Z, Mark R, Berkowitz S, & Horng S (2019). MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet. https://doi.org/10.13026/8360-t248.
Johnson AE, Pollard TJ, Berkowitz S, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S (2019). MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042
Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, & Mark R (2021). MIMIC-IV (version 1.0). PhysioNet. https://doi.org/10.13026/s6n6-xd98.