Database Credentialed Access
MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing
Shengpu Tang , Parmida Davarmanesh , Yanmeng Song , Danai Koutra , Michael Sjoding , Jenna Wiens
Published: April 28, 2021. Version: 1.0.0
When using this resource, please cite:
(show more options)
Tang, S., Davarmanesh, P., Song, Y., Koutra, D., Sjoding, M., & Wiens, J. (2021). MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing (version 1.0.0). PhysioNet. https://doi.org/10.13026/2qtg-k467.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
This is a preprocessed dataset derived from patient records in MIMIC-III and eICU, two large-scale electronic health record (EHR) databases. It contains features and labels for 5 prediction tasks involving 3 adverse outcomes (prediction times listed in parentheses): in-hospital mortality (48h), acute respiratory failure (4h and 12h), and shock (4h and 12h). We extracted comprehensive, high-dimensional feature representations (up to ~8,000 features) using FIDDLE (FlexIble Data-Driven pipeLinE), an open-source preprocessing pipeline for structured clinical data. These 5 prediction tasks were designed in consultation with a critical care physician for their clinical importance, and were used as part of the proof-of-concept experiments in the original paper to demonstrate FIDDLE's utility in aiding the feature engineering step of machine learning model development. The intent of this release is to share preprocessed MIMIC-III and eICU datasets used in the experiments to support and enable reproducible machine learning research on EHR data.
Background
To date, researchers have successfully leveraged electronic health record (EHR) data and machine learning (ML) tools to build patient risk stratification models for many adverse outcomes. However, prior to applying ML, substantial effort must be devoted to preprocessing. EHR data are messy, often consisting of high-dimensional, irregularly sampled time series with multiple data types and missing values. Transforming EHR data into feature vectors suitable for ML techniques requires many decisions, such as what input variables to include, how to resample longitudinal data, and how to handle missing data, among many others. Currently, EHR data preprocessing is largely ad hoc and can vary widely between studies. In an effort to speed up and standardize the preprocessing of EHR data, we proposed FIDDLE [1], a tool that systematically transforms structured EHR data into representations that can be used as inputs to ML algorithms. We evaluated FIDDLE through a proof-of-concept experiment in the context of MIMIC-III and eICU [2-5]. This version of the data was used to produce the results published by Tang et al. [1].
Methods
From MIMIC-III, we focused on 17,710 patients (23,620 ICU visits) monitored using the iMDSoft MetaVision system (2008–2012) for its relative recency over the Philips CareVue system (2001–2008), thus representing more up-to-date clinical practices [2,3]. Each ICU visit is identified by a unique ICUSTAY_ID
.
The eICU Collaborative Research Database consists of data from 139,367 patients (200,859 ICU visits) who were admitted to 200 different ICUs located throughout the United States in 2014 and 2015 [4,5]. Each ICU visit is identified by a unique patientunitstayid
.
For both databases, we extracted data from structured tables that pertain to patient health:
- MIMIC-III (10 tables):
PATIENTS
,ADMISSIONS
,ICUSTAYS
,CHARTEVENTS
,LABEVENTS
,INPUTEVENTS_MV
,OUTPUTEVENTS
,PROCEDUREEVENT_MV
,MICROBIOLOGYEVENTS
,DATETIMEEVENTS
- eICU (18 tables):
patient
,vitalPeriodic
,vitalAperiodic
,lab
,customLab
,medication
,infusionDrug
,intakeOutput
,microLab
,note
,nurseAssessment
,nurseCare
,nurseCharting
,pastHistory
,physicalExam
,respiratoryCare
,respiratoryCharting
,treatment
We formatted the data into a table with 4 columns: [ID
, t
, variable_name
, variable_value
] and then applied FIDDLE (using the default settings) on the processed data tables for each of the 5 prediction tasks to convert them into feature matrices. A snapshot of the code used for data extraction and preprocessing has been included in the code/
folder, but please refer to the project GitHub repository for the latest version [6].
Data Description
The root folder contains 2 subfolders, FIDDLE_mimic3/
and FIDDLE_eicu/
, within each there are 2 subfolders: population/
and features/
.
population/
contains 5 files, specifying the population of ICU stays of each prediction task. It also contains the onset hour (for ARF and shock) as well as the binary label for the adverse outcome.mortality_48h.csv
ARF_4h.csv
ARF_12h.csv
Shock_4h.csv
Shock_12h.csv
features/
includes subfolders (named identically to the population files) that contain the time-invariant featuress
and time-dependent featuresX
for the corresponding prediction task. These features can be used to replicate the main experiments of the paper. Within the subfolder for each task there are:- Time-invariant features
s.npz
: sparse matrix containing time-invariant features.s.feature_names.json
: the string names of the time-invariant features.s.feature_aliases.json
: the alias mapping of time-invariant features.
- Time-dependent features
X.npz
: sparse tensor containing time-dependent features.X.feature_names.json
: the string names of the time-dependent features.X.feature_aliases.json
: the alias mapping of time-dependent features.
- Time-invariant features
The cohort numbers and dimensionalities of extracted features are summarized below.
|
|
Usage Notes
See the included jupyter notebook for an example of loading the features/labels. Please also refer to the project GitHub repository for implementation of experiments in the original paper that contains code to load the features/labels and train various machine learning models [6].
To load the features, you need python and the sparse package [7].
import sparse
import json
s = sparse.load_npz('features/{task}/s.npz').todense()
X = sparse.load_npz('features/{task}/X.npz').todense()
s_feature_names = json.load(open('features/{task}/s.feature_names.json', 'r'))
X_feature_names = json.load(open('features/{task}/X.feature_names.json', 'r'))
To load the labels, use pandas or an alternative csv reader [8]:
import pandas as pd
df_pop = pd.read_csv('population/{task}.csv')
For each task, df_pop
, s
, and X
all have the same length corresponding to the number of ICU stays for the study population of that task; each row corresponds to information pertaining to an ICU stay.
Release Notes
Current Version
The current version of this dataset release is v1.0.0.
v1.0.0
Initial release.
Acknowledgements
This work was supported by the Michigan Institute for Data Science (MIDAS); the National Science Foundation award number IIS-1553146; the National Heart, Lung, and Blood Institute grant number R25HL147207; and the National Library of Medicine grant number R01LM013325. The views and conclusions in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Michigan Institute for Data Science; the National Science Foundation; the National Heart, Lung and Blood Institute; nor the National Library of Medicine.
The authors would also like to thank the members of the MLD3 group at the University of Michigan for helpful discussion regarding this work.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Tang, S., Davarmanesh, P., Song, Y., Koutra, D., Sjoding, M. W., & Wiens, J. (2020). Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. Journal of the American Medical Informatics Association, 27(12), 1921-1934.
- Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35
- Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.
- Pollard, T., Johnson, A., Raffa, J., Celi, L. A., Badawi, O., & Mark, R. (2019). eICU Collaborative Research Database (version 2.0). PhysioNet. https://doi.org/10.13026/C2WM1R.
- Pollard, T., Johnson, A., Raffa, J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5, 180178 (2018). https://doi.org/10.1038/sdata.2018.178
- FIDDLE code repository. https://github.com/MLD3/FIDDLE-experiments [Accessed: 20 April 2021]
- Sparse package for Python. https://sparse.pydata.org/en/stable/ [Accessed: 20 April 2021]
- The Pandas Development Team. pandas-dev/pandas: Pandas. Zenodo. Feb 2020. https://doi.org/10.5281/zenodo.3509134
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/2qtg-k467
DOI (latest version):
https://doi.org/10.13026/zwds-nn76
Topics:
preprocessing
electronic health record
machine learning
Project Website:
https://github.com/MLD3/FIDDLE
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project