Database Credentialed Access
MIMIC-IV-ECG-Ext-ICD: Diagnostic labels for MIMIC-IV-ECG
Nils Strodthoff , Juan Miguel Lopez Alcaraz , Wilhelm Haverkamp
Published: Aug. 30, 2024. Version: 1.0.1
When using this resource, please cite:
(show more options)
Strodthoff, N., Lopez Alcaraz, J. M., & Haverkamp, W. (2024). MIMIC-IV-ECG-Ext-ICD: Diagnostic labels for MIMIC-IV-ECG (version 1.0.1). PhysioNet. https://doi.org/10.13026/ypt5-9d58.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
The number of publicly available ECG datasets has increased tremendously in the past few years and several of these datasets have developed into widely used benchmarking datasets. However, most of them exhibit a common limitation, namely the reliance on retrospective annotation and a lack of clinical ground truth. This represents a serious limitation compared to closed in-hospital datasets. To circumvent this issue, we propose the MIMIC-IV-ECG-Ext-ICD dataset by linking the samples from the MIMIC-IV-ECG dataset to clinical ground truth from the MIMIC-IV dataset, in the form of ED and hospital discharge diagnoses. We release this derived dataset to foster further research on ECG-based prediction models with clinical ground truth and build a resource for benchmarking clinical ECG prediction models.
Background
The dataset was proposed in a recent publication [1] that explored the feasibility of training unified prediction models for general diagnostic conditions using a single ECG as input. This allowed us to train prediction models for a large range of cardiac and, most notably, also non-cardiac conditions.
The dataset relies on the combination of ECG traces from MIMIC-IV-ECG [2] with clinical ground truth from MIMIC-IV [3,4]. At its core, we provide a table that allows to retrieve different kinds of discharge diagnoses (as ICD-10-CM codes) for a given sample in MIMIV-IV-ECG. For the purpose of benchmarking general ECG prediction models along the lines of [1], we provide recommendations for benchmarking scenarios as well as specific instructions that can be used to perform train-test splits, see also [5] for the corresponding code repository. However, we hope that the dataset will be of general interest to people looking into ECG prediction models for specific clinical conditions.
Methods
The proposed MIMIC-IV-ECG-Ext-ICD dataset involved aligning ECG recording times with patient admission and discharge times, retrieval, and standardization of diagnostic codes (ICD-9-CM to ICD-10-CM).
To link samples from the MIMIC-IV-ECG dataset [2] to clinical ground truth from the clinical MIMIC-IV dataset [3,4], for a given patient ID we identified ECGs taken in the ED or hospital by comparing recording times to patient admission, discharge, and potential death times. This process yielded ED stay IDs (ed_stay_id) for ED-captured ECGs and hospital admission IDs (ed_hadm_id/hosp_hadm_id) for ECGs taken in the hospital or for ED ECGs of patients subsequently admitted.
ED stay IDs enabled the retrieval of ED discharge diagnoses (max. 9 ICD-9-CM or ICD-10-CM codes) or hospital discharge diagnoses (max. 39 ICD-9-CM or ICD-10-CM codes) sourced from ED records, while hospital admission IDs link samples to hospital discharge diagnoses sourced from hospital records.
We used the Python package icd-mappings to convert ICD-9-CM to ICD-10-CM codes, establishing a common vocabulary. The preprocessing is fully reproducible based on the preprocessing scripts in the associated code repository [5].
Data Description
Below, we describe the columns provided in the main table 'records_w_diag_icd10.csv'. Note that the 'ed_stay_id', 'hosp_hadm_id' or the 'hosp_hadm_id' allow the retrieval of additional clinical metadata from MIMIC-IV beyond the basic demographic data included in the table for the user's convenience.
column | description |
---|---|
file_name | path to the waveform |
study_id | study id within MIMIC-IV-ECG |
subject_id | patient id within MIMIC-IV-ECG |
ecg_time | time of the waveform collection |
column | description |
---|---|
ed_stay_id | ED stay identifier |
ed_hadm_id | hospital admission identifier sourced from the ED system |
hosp_hadm_id | hospital admission identifier |
ed_diag_ed | CD-10-CM ED discharge diagnoses sourced from the ED system |
ed_diag_hosp | ICD-10-CM hospital discharge diagnoses sourced from MIMIC-IV via 'ed_hadm_id' |
hosp_diag_hosp | ICD-10-CM hospital discharge diagnoses sourced from MIMIC-IV via 'hadm_id' |
all_diag_hosp | unique hospital discharge diagnoses after concatenating 'ed_diag_hosp' and 'hosp_diag_hosp' |
all_diag_all | 'all_diag_hosp' if available otherwise 'ed_diag_ed' |
column | description |
---|---|
gender | patient's gender |
age | patient age at the time of ECG recording calculated from 'anchor_age', 'anchor_year', and 'ecg_time' |
anchor_age | age at 'anchor_year' |
anchor_year | specified reference year |
dod | date of death (if applicable) |
column | description |
---|---|
ecg_no_within_stay | enumerates ECGs within a given ED/hospital stay |
ecg_taken_in_ed | boolean variable indicating if the ECG was taken in the ED |
ecg_taken_in_hosp | boolean variable indicating if the ECG was taken in the hospital |
ecg_taken_in_ed_or_hosp | boolean variable indicating if the ECG was taken either in the ED or in the hospital (i.e. no outpatient ECG) |
column | description |
---|---|
fold | random fold assignments (without patient overlap) to reproduce benchmarking results from [1] |
strat_fold | alternative stratified folds using multi-label stratification (applied to 'all_diag_all' truncated to 5 digits and uppropagation along the label tree, gender, age (binned) and outpatient status) |
Usage Notes
The above table allows the retrieval of diagnostic codes and basic demographic data for samples in MIMIC-IV-ECG (linked through filename). This information can be used to train general-purpose ECG prediction models. For comparability, we invite people to follow the benchmarking recommendations below. We refer to the original publication [1] for the first benchmarking results and [5] for the corresponding code repository. The code repository contains a demo notebook "src/Demo.ipynb" that demonstrates how to load the dataframe in Python and provides utility functions to truncate ICD codes at a certain level and convert them into appropriate label sets that can be used to train supervised classification models.
Benchmarking recommendations
For benchmarking, we distinguish different scenarios based on the selection of samples and labels. Following [1], we introduce the notation T(A2B)-E(C2D), where A,C were taken from {ALL,ED,HOSP} (corresponding to 'ecg_taken_in_ed_or_hosp', 'ecg_taken_in_ed' or 'ecg_taken_in_hosp' set to true) refers to the subset of ECGs used for training/evaluation and B,D taken from {ALL,ED,HOSP} (corresponding to 'all_diag_all', 'ed_diag_ed' or 'all_diag_all', respectively) refer to the label sets used for training/evaluation. Different combinations of training and evaluation waveforms and discharge diagnosis enable investigation into various conditions based on subsets used for training/evaluation and label sets, which do not necessarily have to coincide.
The MIMIC-IV-ECG dataset patients were randomly assigned to twenty folds: 0 to 17 for training, 18 for validation and model selection, and 19 for testing. For benchmark purposes, in contrast to the model training process, we only select the respective first ECG for each stay ('ecg_no_within_stay'==0) for the validation and test folds to prevent bias in model evaluation due to patients with a large number of ECGs per stay.
All codes were truncated to a consistent five-digit format (e.g., I48.92 for unspecified atrial flutter), retaining entries with fewer digits and removing trailing ’X’ placeholder characters. Superclasses up to a three-digit level were included to ensure consistent mapping (e.g., I48.9 and I48 for the example of I48.92 from above, respectively). We discard samples with an empty set of discharge diagnoses. The final label sets are selected by discarding codes occurring less than 2000 times in the dataset (selection based on ALL2ALL to ensure consistency across all scenarios), resulting in a label set comprising 1076 3- to 5-digit ICD-10-CM codes for training. For a more comprehensive description and first benchmarking results in different scenarios see [1] and the corresponding code repository [5].
For the benchmark results from [1] (Table A.1 in the supplementary material), ECG samples were resampled to 100Hz, with missing signal values linearly interpolated and infrequent missing values at sequence boundaries replaced with zero. Signals were clipped to a maximum amplitude of 3 mV. Apart from resampling, handling missing values, and clipping, no further preprocessing was applied to the raw ECG signals.
Release Notes
1.0.0 Initial release of the dataset.
1.0.1 Fixed issues with stratified folds (not used in the original benchmark)
Ethics
This study utilized data from the publicly available Medical Information Mart for Intensive Care (MIMIC) database. The use of MIMIC data for research purposes is governed by the Health Insurance Portability and Accountability Act (HIPAA) in the United States, and researchers are required to adhere to strict ethical guidelines when accessing and using this data.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Strodthoff, N. Alcaraz, J.M.L., & Haverkamp, W. (2024). Prospects for Artificial Intelligence-Enhanced ECG as a Unified Screening Tool for Cardiac and Non-Cardiac Conditions – An Explorative Study in Emergency Care, European Heart Journal - Digital Health, ztae039. https://doi.org/10.1093/ehjdh/ztae039
- Gow, B., Pollard, T., Nathanson, L. A., Johnson, A., Moody, B., Fernandes, C., Greenbaum, N., Waks, J. W., Eslami, P., Carbonati, T., Chaudhari, A., Herbst, E., Moukheiber, D., Berkowitz, S., Mark, R., & Horng, S. (2023). MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset (version 1.0). PhysioNet. https://doi.org/10.13026/4nqg-sb35
- Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV (version 2.2). PhysioNet. https://doi.org/10.13026/6mm1-ek67
- Johnson, A.E.W., Bulgarelli, L., Shen, L. et al (2023). MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10, 1. https://doi.org/10.1038/s41597-022-01899-x
- Strodthoff, N. Alcaraz, J.M.L., & Haverkamp, W. (2024). Code repository (2024). https://github.com/AI4HealthUOL/ECG-MIMIC https://zenodo.org/doi/10.5281/zenodo.11403028
- Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze, F.I., Samek, W., Schaeffter, T. (2020). PTB-XL: A Large Publicly Available ECG Dataset. Scientific Data. https://doi.org/10.1038/s41597-020-0495-6
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.1):
https://doi.org/10.13026/ypt5-9d58
DOI (latest version):
https://doi.org/10.13026/bd2x-mf68
Topics:
electrocardiography
machine learning
mimic
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project