Database Credentialed Access
Symile-MIMIC: a multimodal clinical dataset of chest X-rays, electrocardiograms, and blood labs from MIMIC-IV
Adriel Saporta , Aahlad Manas Puli , Mark Goldstein , Rajesh Ranganath
Published: Jan. 28, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Saporta, A., Puli, A. M., Goldstein, M., & Ranganath, R. (2025). Symile-MIMIC: a multimodal clinical dataset of chest X-rays, electrocardiograms, and blood labs from MIMIC-IV (version 1.0.0). PhysioNet. https://doi.org/10.13026/3vvj-s428.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Symile-MIMIC is a multimodal clinical dataset derived from MIMIC-IV and MIMIC-CXR, consisting of chest X-rays (CXRs), electrocardiograms (ECGs), and blood laboratory tests. It was developed to evaluate Symile, a contrastive learning objective designed to handle multiple modalities and enable any model to generate representations for each modality. The dataset explores whether ECG and blood work collected at admission are predictive of a CXR taken shortly thereafter. Symile-MIMIC includes 11,622 admissions, split into training, validation, and test sets with no patient overlap between splits. Each sample contains an ECG, a CXR, and up to 50 common blood lab results. This module provides: (1) the dataset in CSV format, (2) pre-processed tensors of the dataset, (3) the code to generate the dataset from MIMIC-IV and MIMIC-CXR, and (4) the best model checkpoint trained on the Symile-MIMIC dataset using the Symile objective.
Background
While contrastive learning was originally designed to maximize the mutual information between two modalities, domains such as robotics, healthcare, and video require handling multiple modalities simultaneously. To address this issue, we developed Symile, a simple contrastive learning objective that accommodates any number of modalities and allows any model to produce representations for each modality.
In healthcare, zero-shot retrieval is a common approach for evaluating representation learning methods. Our work applies the Symile objective to Symile-MIMIC, a multimodal clinical dataset consisting of chest X-rays (CXRs), electrocardiograms (ECGs), and blood laboratory tests from the MIMIC-IV [1, 2, 3] and MIMIC-CXR [4, 5] datasets. Symile-MIMIC was developed to investigate whether an ECG and blood labs collected at admission are predictive of a CXR taken shortly thereafter.
Methods
Symile-MIMIC consists of chest X-rays (CXRs), electrocardiograms (ECGs), and blood laboratory results from MIMIC-IV [2], MIMIC-IV-ECG [3], and MIMIC-CXR-JPG [5]. Each data sample includes an ECG reading and blood labs taken within 24 hours of the patient's admission, along with a CXR performed 24-72 hours post-admission. The dataset includes only admissions from the original datasets that contain all three modalities: CXR, ECG, and at least one lab value. For each admission, the earliest available CXR, ECG, and lab results are selected.
CXR: Following CheXpert [6], each CXR is scaled such that the smaller edge is set to cxr_scale = 320
, followed by a square crop (random for training or center for validation and testing). Images are then normalized using the ImageNet mean and standard deviation. Only CXRs with a posteroanterior (PA) or anteroposterior (AP) view are included.
ECG: ECGs with missing values or zero signals are excluded, and each ECG signal is normalized to lie within the range [-1, 1].
Labs: Lab data is limited to the 50 most common blood tests (see Data Description). We use a 100-dimensional vector as input to the labs encoder, where the first 50 coordinates are lab values standardized to percentiles based on the training set's empirical CDF, and the remaining 50 coordinates are binary indicators that denote whether each lab value is missing. When a lab value is unobserved, the mean percentile for that lab is substituted.
The dataset is split into training, validation, and test sets, with no patient overlap between splits. The dataset was developed for the zero-shot retrieval task of identifying the most probable candidate CXR for a given query ECG and labs pair, based on the Symile similarity score. Therefore, we build the val_retrieval and test sets specifically for this task, treating each data sample in the val and test sets as a query for CXR retrieval. For each query, we sample 9 negative candidates from the remaining data in the respective split, ensuring that each query has a total of 10 candidates: 1 positive (the query itself) and 9 negatives.
Data Description
symile_mimic_data.csv
The Symile-MIMIC dataset contains 11,622 unique admissions for 9,573 patients. Each row represents a unique hospital admission and includes patient data, admission details, data on an ECG reading and blood labs taken within 24 hours of admission, and data on a CXR performed 24-72 hours post-admission.
The dataset contains the following columns:
Patient and admission data
subject_id
: Unique patient identifier.hadm_id
: Unique hospital admission identifier.admittime
: Time of admission.dischtime
: Time of discharge.deathtime
: Time of death (if applicable).admission_type
: Type of admission (e.g., emergency, elective).admission_location
: Location of the patient prior to arriving at the hospital.discharge_location
: Disposition of the patient after they are discharged from the hospital.race
: Patient's race.hospital_expire_flag
: Indicates whether the patient died within the given hospitalization.gender
: Patient's gender.anchor_age
: Patient's age in theanchor_year
.anchor_year
: Shifted year for the patient. For example, if a patient has ananchor_year
of 2153 and ananchor_age
of 60, then the patient was 60 in the shifted year of 2153.age
: Patient’s age at the time of admission.dod
: Date of death (if applicable).admittime_year
: Year of admission (extracted fromadmittime
).
Further details on the columns can be found in the MIMIC documentation for the admissions table in the MIMIC-IV Hosp module [7].
ECG data
ecg_study_id
(same asecg_adm
andecg_file_name
): Unique identifier for the ECG study.ecg_time
: Time the ECG was collected.ecg_path
: Path to the ECG file (e.g.files/p1000/p10001176/s40925049/40925049
)
Further details on the columns can be found in the MIMIC documentation for the MIMIC-IV ECG module [8].
CXR data
cxr_dicom_id
(same ascxr_24_72_hr
): Unique identifier for the CXR DICOM file.cxr_study_id
(same asstudy_id
): Unique identifier for the CXR study.cxr_ViewPosition
: The orientation in which the CXR was taken (e.g., PA, AP).cxr_ViewCodeSequence_CodeMeaning
: The human readable description of the coded view orientation for the image (e.g. postero-anterior, antero-posterior, lateral).cxr_StudyDateTime
: Date and time the CXR study was performed.cxr_path
: Path to the CXR image file (e.g.files/p10/p10001176/s54684191/3b8b1b7d-054490d5-385641e7-ff43d2c8-9505f058.jpg
).- CheXpert label columns: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax, Support Devices.
Further details on the columns can be found on the MIMIC-CXR-JPG PhysioNet page [5].
Labs data
- 50 columns representing the
itemid
for the 50 labs listed below and containing the earliest lab value recorded within 24 hours of admission. labs_all_nan
: A binary indicator (0 or 1) that shows whether all lab values in the specified columns are missing (1
if all lab values are missing,0
otherwise).
Further details on the columns can be found in the MIMIC documentation for the labevents table in the MIMIC-IV Hosp module [9].
# keys are itemids, values are lab names
LABS = {
"51221": "Hematocrit",
"51265": "Platelet Count",
"50912": "Creatinine",
"50971": "Potassium",
"51222": "Hemoglobin",
"51301": "White Blood Cells",
"51249": "MCHC",
"51279": "Red Blood Cells",
"51250": "MCV",
"51248": "MCH",
"51277": "RDW",
"51006": "Urea Nitrogen",
"50983": "Sodium",
"50902": "Chloride",
"50882": "Bicarbonate",
"50868": "Anion Gap",
"50931": "Glucose",
"50960": "Magnesium",
"50893": "Calcium, Total",
"50970": "Phosphate",
"51237": "INR(PT)",
"51274": "PT",
"51275": "PTT",
"51146": "Basophils",
"51256": "Neutrophils",
"51254": "Monocytes",
"51200": "Eosinophils",
"51244": "Lymphocytes",
"52172": "RDW-SD",
"50934": "H",
"51678": "L",
"50947": "I",
"50861": "Alanine Aminotransferase (ALT)",
"50878": "Asparate Aminotransferase (AST)",
"50813": "Lactate",
"50863": "Alkaline Phosphatase",
"50885": "Bilirubin, Total",
"50820": "pH",
"50862": "Albumin",
"50802": "Base Excess",
"50821": "pO2",
"50804": "Calculated Total CO2",
"50818": "pCO2",
"52075": "Absolute Neutrophil Count",
"52073": "Absolute Eosinophil Count",
"52074": "Absolute Monocyte Count",
"52069": "Absolute Basophil Count",
"51133": "Absolute Lymphocyte Count",
"50910": "Creatine Kinase (CK)",
"52135": "Immature Granulocytes"
}
train.csv
The train split is derived from the full dataset in symile_mimic_data.csv
and contains 10,000 unique admissions for 8,374 patients. In addition to the columns from the full dataset, the train split includes 50 additional columns representing lab values as standardized percentiles (e.g., 50878_percentile
), calculated using the empirical CDF of the training set.
val.csv
The val split is derived from the full dataset in symile_mimic_data.csv
and includes 750 unique admissions for 738 patients. It contains the same columns as the training set, including 50 lab value percentile columns, which were calculated using the empirical CDF of the training set.
val_retrieval.csv
To create the val_retrieval data split, each sample from the validation set is treated as a query for the CXR retrieval task. For each query, 9 negative candidates are sampled from the remaining data in the validation set, ensuring that each query has 10 total candidates: 1 positive (the query itself) and 9 negatives. As a result, val_retrieval contains 7,500 rows (750 queries * 10 candidates per query), with the same columns as the validation set and two additional columns:
label_hadm_id
: Thehadm_id
of the query for which this row serves as a candidate.label
: A binary indicator where 1 denotes a positive candidate and 0 denotes a negative candidate for the correspondinglabel_hadm_id
. A row is always a positive candidate for itself, meaning wheneverlabel == 1
, it follows thatlabel_hadm_id == hadm_id
.
test.csv
The test split is derived from the full dataset in symile_mimic_data.csv
and contains 464 unique admissions for 461 patients. It shares the same columns and structure as the val_retrieval split, and was constructed in the same way. As a result, the test split consists of 4,640 rows (464 queries * 10 candidates per query). The 50 lab value percentile columns were calculated using the empirical CDF of the training set.
labs_means.json
This file contains the mean values for the lab percentile columns (those ending with _percentile
) from the dataset. Each entry represents the average percentile for a specific lab, calculated across all rows in the dataset. This is calculated because during training, when a lab value is unobserved, the mean percentile for that lab is substituted.
symile_mimic_model.ckpt
The best model checkpoint trained on the Symile-MIMIC dataset using the Symile objective. We use the ResNet-50 and ResNet-18 architectures for the CXR and ECG encoders, respectively, and a three-layer neural network to encode the blood labs. All encoders were trained from scratch, and three linear projections map each encoder's representation to the same 8192-dimensional space. Further details can be found in the Symile GitHub repository [10].
data_npy/
This directory contains preprocessed tensor files in .npy
format derived from the Symile-MIMIC dataset. The tensors are saved in split-specific directories: train, val, val_retrieval, test. The files follow a specific naming convention based on the data type and split (e.g., cxr_train.npy
, ecg_val_retrieval.npy
). Below is a description of the tensors contained in each split directory:
cxr_<split>.npy
: A tensor containing preprocessed CXR images. Following CheXpert [6], each CXR is scaled such that the smaller edge is set tocxr_scale = 320
, followed by a square crop (random for training or center for validation and testing). Images are then normalized using the ImageNet mean and standard deviation. The tensor has dimensions (n, 3, 320, 320).ecg_<split>.npy
: A tensor containing normalized ECG signals. The signals are normalized between -1 and 1 and stored with dimensions (n, 1, 5000, 12), representing the signal length and leads.labs_percentiles_<split>.npy
: A tensor representing the lab values as standardized percentiles. This tensor has dimensions (n, 50), with one percentile value for each of the 50 lab features.labs_missingness_<split>.npy
: A tensor indicating the missingness of lab values, with dimensions (n, 50). A value of 0 indicates a missing lab value, while 1 indicates the presence of a lab value.hadm_id_<split>.npy
: A tensor containing the hospital admission IDs for each row in the split.label_hadm_id_<split>.npy
(only in val_retrieval and test splits): A tensor containing the queryhadm_id
for which each row is a candidate.label_<split>.npy
(only in val_retrieval and test splits): A binary tensor indicating whether each row is a positive (1) or negative (0) candidate for the correspondinglabel_hadm_id
.
code/
This directory contains scripts to generate, process, and prepare the Symile-MIMIC dataset. The directory includes a detailed README.md
with instructions, along with environment.yml
and requirements.txt
files for setting up the required environment.
Usage Notes
The code/
directory in this module contains scripts to generate, process, and prepare the Symile-MIMIC dataset. The directory includes a detailed README.md
with instructions, along with environment.yml
and requirements.txt
files for setting up the required environment. These same scripts can be found in the Symile GitHub repository [10]. This GitHub repository also includes the code used to reproduce the experiments from the paper, including the experiment that evaluated the Symile objective on the Symile-MIMIC dataset. The best model checkpoint from this experiment is included in this module as symile_mimic_model.ckpt
.
Release Notes
Version 1.0.0: Initial upload of dataset, code, and model checkpoint.
Ethics
This project utilizes data derived from the MIMIC-IV [2], MIMIC-IV-ECG [3], and MIMIC-CXR-JPG [5] databases. All three databases are fully de-identified. Furthermore, all authors have obtained the necessary permissions to access and use these datasets under a PhysioNet Credentialed Health Data Use Agreement.
Conflicts of Interest
No conflicts of interest to declare.
References
- Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, Lehman LH, Celi LA, Mark RG. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.
- Johnson A, Bulgarelli L, Pollard T, Horng S, Celi L A, Mark R. MIMIC-IV (version 2.2). PhysioNet. 2023. Available from: https://doi.org/10.13026/6mm1-ek67
- Gow B, Pollard T, Nathanson L A, Johnson A, Moody B, Fernandes C, Greenbaum N, Waks J W, Eslami P, Carbonati T, Chaudhari A, Herbst E, Moukheiber D, Berkowitz S, Mark R, Horng S. MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset (version 1.0). PhysioNet. 2023. Available from: https://doi.org/10.13026/4nqg-sb35
- Johnson AE, Pollard TJ, Berkowitz S, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019 Jan 21.
- Johnson A, Lungren M, Peng Y, Lu Z, Mark R, Berkowitz S, Horng S. MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet. 2019. Available from: https://doi.org/10.13026/8360-t248
- Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball R, Shpanskaya K, Seekins J. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence 2019 Jul 17 (Vol. 33, No. 01, pp. 590-597).
- MIMIC-IV admissions table [Internet]. Boston: MIT Laboratory for Computational Physiology; 2024 [updated 2023 Feb 3; cited 2025 Jan 12]. Available from: https://mimic.mit.edu/docs/iv/modules/hosp/admissions/
- MIMIC-IV ECG Module [Internet]. Boston: MIT Laboratory for Computational Physiology; 2024 [updated 2023 Dec 21; cited 2025 Jan 12]. Available from: https://mimic.mit.edu/docs/iv/modules/ecg/
- MIMIC-IV labevents table [Internet]. Boston: MIT Laboratory for Computational Physiology; 2024 [updated 2023 Feb 3; cited 2025 Jan 12]. Available from: https://mimic.mit.edu/docs/iv/modules/hosp/labevents/
- Symile [Internet]. GitHub: Rajesh Lab; 2023 [updated 2024 Nov 14; cited 2025 Jan 12]. Available from: https://github.com/rajesh-lab/symile
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/3vvj-s428
DOI (latest version):
https://doi.org/10.13026/762h-qf80
Topics:
database
cxr
ecg
chest x-ray
contrastive learning
model
multimodal
mimic
electrocardiogram
Project Website:
https://github.com/rajesh-lab/symile
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project