Database Credentialed Access
MIMIC-III-Ext-tPatchGNN
Published: April 9, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Yin, C., & Zhang, W. (2025). MIMIC-III-Ext-tPatchGNN (version 1.0.0). PhysioNet. https://doi.org/10.13026/ckn0-3868.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
This dataset is a curated subset of MIMIC-III (v1.4), specifically formatted to facilitate reproducibility of the experiments in the work t-PatchGNN. It serves as part of a benchmark designed for forecasting irregular multivariate clinical time series, that is, given a set of historical Irregular Multivariate Time Series (IMTS) observations and forecasting queries, the forecasting problem aims to accurately forecast the values in correspondence to these queries. This requires addressing key challenges such as missing data, variable sampling rates, and complex temporal dependencies. The dataset includes patient records with diverse physiological measurements, each sampled at irregular intervals, reflecting real-world clinical scenarios. It is structured to capture both short-term and long-term temporal patterns, making it well-suited for evaluating machine learning models in medical time series forecasting. By providing a standardized benchmark, this dataset aims to advance research in predictive modeling for healthcare, enabling the development of robust algorithms that can handle irregular and sparse clinical data. The dataset’s applications extend to critical areas such as early disease detection, patient risk stratification, and treatment outcome prediction, making it a valuable resource for the medical AI and machine learning communities.
Background
Irregular multivariate time series data are prevalent in real-world applications, particularly in healthcare [1], where data collection often occurs at inconsistent intervals and involves multiple variables. Such data pose significant modeling challenges due to irregular sampling, missing values, and the need to capture both temporal and inter-variable relationships [2].
This research addresses these challenges by proposing a novel methodology, Transformable Patching Graph Neural Networks (t-PatchGNN). The approach leverages graph-based modeling techniques, introducing a patching mechanism that transforms irregular data into structured representations. This enables effective forecasting while preserving the temporal and multivariate characteristics inherent in the data.
To support the evaluation of the proposed methodology, a dataset processed from the publicly available MIMIC-III clinical database is used. MIMIC-III serves as a general-purpose resource for benchmarking methods that handle irregular time series, due to its diverse and complex clinical data. The dataset processing follows the methodology described in "Neural Flows: Efficient Alternative to Neural ODEs [3]," which efficiently handles irregularities in time series data and ensures compatibility with a wide range of model architectures.
This research contributes to the broader field of time series forecasting and predictive analytics by addressing the complexities of irregular sampling. The proposed methods are particularly relevant in domains like healthcare and finance, where such challenges frequently arise, highlighting the importance of innovative tools for accurate and interpretable predictions.
Methods
The dataset was derived from the publicly available MIMIC-III database, which contains comprehensive clinical records of intensive care unit (ICU) patients, including physiological signals, laboratory test results, medication records, and other medical observations. MIMIC-III inherently presents challenges for time-series analysis due to its irregular sampling, missing values, and heterogeneous variable distributions.
Data Selection and Inclusion Criteria
To create MIMIC-III-Ext-tPatchGNN, we apply the following criteria to select a subset of admissions for the dataset:
- Include only patients in the Metavision system.
- Retain only patients with a single admission.
- Select patients whose admission duration is between 48 hours and 30 days.
- Exclude patients younger than 15 years at the time of admission.
- Remove patients without chart events data.
- Exclude patients with fewer than 50 measurements within the first 48 hours (equivalent to recording only half of the retained variables once in 48 hours).
After applying these criteria, the dataset is restricted to 23,457 patients.
Preprocessing and Feature Engineering
The raw MIMIC-III data underwent a multi-step preprocessing pipeline to standardize and format it for time-series forecasting tasks. The key preprocessing steps included:
-
Data Extraction and Alignment
- Time-stamped physiological measurements, laboratory results, and vital signs were extracted from relevant tables within MIMIC-III (e.g.,
CHARTEVENTS, LABEVENTS, INPUTEVENTS, and OUTPUTEVENTS
). - Data entries were aligned to a unified time axis, handling variable sampling rates and irregular time intervals between measurements.
- Time-stamped physiological measurements, laboratory results, and vital signs were extracted from relevant tables within MIMIC-III (e.g.,
-
Data Masking
- A mask matrix was introduced to indicate missing values explicitly, preserving information about irregularity while allowing models to differentiate observed versus missing data points.
The entire dataset processing pipeline was implemented in Python, leveraging libraries such as pandas, NumPy, and PyTorch. The preprocessing scripts are publicly available in Neural Flow's [3] official repository, ensuring reproducibility and ease of integration into benchmarking workflows.
Data Description
The dataset is a single file, mimic.pt
, which is a processed and serialized PyTorch tensor. This file encapsulates the irregular multivariate time series data derived from the MIMIC-III database. It has been formatted to facilitate ease of use and compatibility with a variety of models and frameworks, enabling the reproduction of experiments and supporting further research in forecasting irregular time series data.
To use this dataset, please first download the t-PatchGNN
git clone https://github.com/usail-hkust/t-PatchGNN.git
Then, put the file in the following path
data/mimic/processed/mimic.pt
After these steps, you can directly replicate the results of the benchmark.
Additionally, we provide two supplementary files: full_dataset.csv
and variable_dict.csv
. full_dataset.csv is the CSV-formatted version of mimic.pt, and variable_dict.csv explains the used variables of the dataset.
Usage Notes
How the Data is to Be Used
The dataset, mimic.pt, is designed for tasks related to irregular multivariate time series forecasting. It is particularly suitable for reproducing results from the paper "Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach" [2], as well as advancing research in related fields. Researchers can directly integrate the processed data into the tPatchGNN model or other forecasting frameworks for experimentation, validation, and benchmarking purposes.
Potential Applications
This dataset is valuable for a variety of applications, especially those involving healthcare and other domains where multivariate time series data is often collected at irregular intervals. It can be utilized to:
- Develop and evaluate models that handle irregularly sampled time series.
- Benchmark algorithms designed for time series forecasting, particularly those incorporating machine learning techniques.
- Explore missing data modeling and strategies for coping with varying sampling rates, both of which are common challenges in real-world time series data.
Known Limitations
- Only a subset of the complete MIMIC-III dataset has been selected for inclusion, which may limit the representativeness of the data for certain tasks or conditions not covered by the chosen subset.
Related Software and Tools
To facilitate the use of the dataset, we provide code and tools for preprocessing and model experimentation. The key software libraries involved include:
- Python: The core programming language for data manipulation and model implementation.
- PyTorch: Used for deep learning model development and training.
- NumPy: Utilized for numerical data processing and manipulation.
Ethics
This project builds upon previously established datasets, which have been de-identified and approved for credentialed distribution. All data used in this research is sourced from MIMIC-III Clinical Database.
Conflicts of Interest
The authors of this project declare that they have no financial, commercial, legal, or professional relationships with other organizations or individuals that could influence this research. There are no conflicts of interest associated with this project.
References
- Johnson A, Pollard T, Mark R. MIMIC-III Clinical Database (version 1.4). PhysioNet. 2016. Available from: https://doi.org/10.13026/C2XW26.
- Zhang W, Yin C, Liu H, Zhou X, Xiong H. Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach. In: Proceedings of the Forty-first International Conference on Machine Learning. 2024. Available from: https://openreview.net/forum?id=UZlMXUGI6e.
- Biloš M, Sommer J, Rangapuram SS, Januschowski T, Günnemann S. Neural flows: Efficient alternative to neural ODEs. Advances in neural information processing systems. 2021.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/ckn0-3868
DOI (latest version):
https://doi.org/10.13026/rzea-ah40
Project Website:
https://github.com/usail-hkust/t-PatchGNN
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project