Name: MIMIC-III-Ext-tPatchGNN
Published: April 9, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Chenlong Yin , Weijia Zhang

Published: April 9, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Yin, C., & Zhang, W. (2025). MIMIC-III-Ext-tPatchGNN (version 1.0.0). PhysioNet. https://doi.org/10.13026/ckn0-3868.

MLA	Yin, Chenlong, and Weijia Zhang. "MIMIC-III-Ext-tPatchGNN" (version 1.0.0). PhysioNet (2025), https://doi.org/10.13026/ckn0-3868.
APA	Yin, C., & Zhang, W. (2025). MIMIC-III-Ext-tPatchGNN (version 1.0.0). PhysioNet. https://doi.org/10.13026/ckn0-3868.
Chicago	Yin, Chenlong, and Weijia Zhang. "MIMIC-III-Ext-tPatchGNN" (version 1.0.0). PhysioNet (2025). https://doi.org/10.13026/ckn0-3868.
Harvard	Yin, C., and Zhang, W. (2025) 'MIMIC-III-Ext-tPatchGNN' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/ckn0-3868.
Vancouver	Yin C, Zhang W. MIMIC-III-Ext-tPatchGNN (version 1.0.0). PhysioNet. 2025. Available from: https://doi.org/10.13026/ckn0-3868.

Additionally, please cite the original publication:

Zhang W, Yin C, Liu H, Zhou X, Xiong H. Irregular multivariate time series forecasting: A transformable patching graph neural networks approach. In Forty-first International Conference on Machine Learning 2024 Jul 8.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

This dataset is a curated subset of MIMIC-III (v1.4), specifically formatted to facilitate reproducibility of the experiments in the work t-PatchGNN. It serves as part of a benchmark designed for forecasting irregular multivariate clinical time series, that is, given a set of historical Irregular Multivariate Time Series (IMTS) observations and forecasting queries, the forecasting problem aims to accurately forecast the values in correspondence to these queries. This requires addressing key challenges such as missing data, variable sampling rates, and complex temporal dependencies. The dataset includes patient records with diverse physiological measurements, each sampled at irregular intervals, reflecting real-world clinical scenarios. It is structured to capture both short-term and long-term temporal patterns, making it well-suited for evaluating machine learning models in medical time series forecasting. By providing a standardized benchmark, this dataset aims to advance research in predictive modeling for healthcare, enabling the development of robust algorithms that can handle irregular and sparse clinical data. The dataset’s applications extend to critical areas such as early disease detection, patient risk stratification, and treatment outcome prediction, making it a valuable resource for the medical AI and machine learning communities.

Background

Irregular multivariate time series data are prevalent in real-world applications, particularly in healthcare [1], where data collection often occurs at inconsistent intervals and involves multiple variables. Such data pose significant modeling challenges due to irregular sampling, missing values, and the need to capture both temporal and inter-variable relationships [2].

This research addresses these challenges by proposing a novel methodology, Transformable Patching Graph Neural Networks (t-PatchGNN). The approach leverages graph-based modeling techniques, introducing a patching mechanism that transforms irregular data into structured representations. This enables effective forecasting while preserving the temporal and multivariate characteristics inherent in the data.

To support the evaluation of the proposed methodology, a dataset processed from the publicly available MIMIC-III clinical database is used. MIMIC-III serves as a general-purpose resource for benchmarking methods that handle irregular time series, due to its diverse and complex clinical data. The dataset processing follows the methodology described in "Neural Flows: Efficient Alternative to Neural ODEs [3]," which efficiently handles irregularities in time series data and ensures compatibility with a wide range of model architectures.

This research contributes to the broader field of time series forecasting and predictive analytics by addressing the complexities of irregular sampling. The proposed methods are particularly relevant in domains like healthcare and finance, where such challenges frequently arise, highlighting the importance of innovative tools for accurate and interpretable predictions.

Methods

The dataset was derived from the publicly available MIMIC-III database, which contains comprehensive clinical records of intensive care unit (ICU) patients, including physiological signals, laboratory test results, medication records, and other medical observations. MIMIC-III inherently presents challenges for time-series analysis due to its irregular sampling, missing values, and heterogeneous variable distributions.

Data Selection and Inclusion Criteria

To create MIMIC-III-Ext-tPatchGNN, we apply the following criteria to select a subset of admissions for the dataset:

Include only patients in the Metavision system.
Retain only patients with a single admission.
Select patients whose admission duration is between 48 hours and 30 days.
Exclude patients younger than 15 years at the time of admission.
Remove patients without chart events data.
Exclude patients with fewer than 50 measurements within the first 48 hours (equivalent to recording only half of the retained variables once in 48 hours).

After applying these criteria, the dataset is restricted to 23,457 patients.

Preprocessing and Feature Engineering

The raw MIMIC-III data underwent a multi-step preprocessing pipeline to standardize and format it for time-series forecasting tasks. The key preprocessing steps included:

Data Extraction and Alignment
- Time-stamped physiological measurements, laboratory results, and vital signs were extracted from relevant tables within MIMIC-III (e.g., CHARTEVENTS, LABEVENTS, INPUTEVENTS, and OUTPUTEVENTS).
- Data entries were aligned to a unified time axis, handling variable sampling rates and irregular time intervals between measurements.
Data Masking
- A mask matrix was introduced to indicate missing values explicitly, preserving information about irregularity while allowing models to differentiate observed versus missing data points.

The entire dataset processing pipeline was implemented in Python, leveraging libraries such as pandas, NumPy, and PyTorch. The preprocessing scripts are publicly available in Neural Flow's [3] official repository, ensuring reproducibility and ease of integration into benchmarking workflows.

Data Description

The dataset is a single file, mimic.pt, which is a processed and serialized PyTorch tensor. This file encapsulates the irregular multivariate time series data derived from the MIMIC-III database. It has been formatted to facilitate ease of use and compatibility with a variety of models and frameworks, enabling the reproduction of experiments and supporting further research in forecasting irregular time series data.

To use this dataset, please first download the t-PatchGNN

git clone https://github.com/usail-hkust/t-PatchGNN.git

Then, put the file in the following path

data/mimic/processed/mimic.pt

After these steps, you can directly replicate the results of the benchmark.

Additionally, we provide two supplementary files: full_dataset.csv and variable_dict.csv. full_dataset.csv is the CSV-formatted version of mimic.pt, and variable_dict.csv explains the used variables of the dataset.

Usage Notes

How the Data is to Be Used

The dataset, mimic.pt, is designed for tasks related to irregular multivariate time series forecasting. It is particularly suitable for reproducing results from the paper "Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach" [2], as well as advancing research in related fields. Researchers can directly integrate the processed data into the tPatchGNN model or other forecasting frameworks for experimentation, validation, and benchmarking purposes.

Potential Applications

This dataset is valuable for a variety of applications, especially those involving healthcare and other domains where multivariate time series data is often collected at irregular intervals. It can be utilized to:

Develop and evaluate models that handle irregularly sampled time series.
Benchmark algorithms designed for time series forecasting, particularly those incorporating machine learning techniques.
Explore missing data modeling and strategies for coping with varying sampling rates, both of which are common challenges in real-world time series data.

Known Limitations

Only a subset of the complete MIMIC-III dataset has been selected for inclusion, which may limit the representativeness of the data for certain tasks or conditions not covered by the chosen subset.

Related Software and Tools

To facilitate the use of the dataset, we provide code and tools for preprocessing and model experimentation. The key software libraries involved include:

Python: The core programming language for data manipulation and model implementation.
PyTorch: Used for deep learning model development and training.
NumPy: Utilized for numerical data processing and manipulation.

Ethics

This project builds upon previously established datasets, which have been de-identified and approved for credentialed distribution. All data used in this research is sourced from MIMIC-III Clinical Database.

Conflicts of Interest

The authors of this project declare that they have no financial, commercial, legal, or professional relationships with other organizations or individuals that could influence this research. There are no conflicts of interest associated with this project.

References

Johnson A, Pollard T, Mark R. MIMIC-III Clinical Database (version 1.4). PhysioNet. 2016. Available from: https://doi.org/10.13026/C2XW26.
Zhang W, Yin C, Liu H, Zhou X, Xiong H. Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach. In: Proceedings of the Forty-first International Conference on Machine Learning. 2024. Available from: https://openreview.net/forum?id=UZlMXUGI6e.
Biloš M, Sommer J, Rangapuram SS, Januschowski T, Günnemann S. Neural flows: Efficient alternative to neural ODEs. Advances in neural information processing systems. 2021.