Database Credentialed Access
MIMIC-IV-Ext-CEKG: A Process-Oriented Dataset Derived from MIMIC-IV for Enhanced Clinical Insights
Milad Naeimaei Aali , Felix Mannhardt , Pieter Jelle Toussaint
Published: April 8, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Naeimaei Aali, M., Mannhardt, F., & Toussaint, P. J. (2025). MIMIC-IV-Ext-CEKG: A Process-Oriented Dataset Derived from MIMIC-IV for Enhanced Clinical Insights (version 1.0.0). PhysioNet. https://doi.org/10.13026/qr9d-6t52.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Maintaining a healthy population is essential for improving quality of life and overall societal well-being. One approach to achieving a healthy population is by improving patients' care pathways. This is particularly vital for patients with multiple chronic conditions, who require well-coordinated care across various medical specialties. One approach to improving and analyzing the care pathways of these types of patients is process mining. The clinical event knowledge graph is a recent framework in process mining introduced for patients with multimorbidity that facilitates standardized interpretation of care pathways by linking to ICD-10 and SNOMED-CT. It also facilitates storing recent multi-entity event data in the event graph and analyzing care pathways for multimorbid patients from multiple perspectives. MIMIC-IV is a dataset that facilitates data analysis in healthcare; however, it is not specialized for process mining and requires extensive data preprocessing to prepare it for process mining. This paper contributes to the MIMIC-IV-Ext-CEKG dataset, an extracted dataset from MIMIC-IV that facilitates using the Clinical Event Knowledge Graph framework and other process mining tasks. This paper describes its characteristics and how it is extracted from the MIMIC-IV dataset. MIMIC-IV-Ext-CEKG facilitates deploying MIMIC-IV for process mining.
Background
To maintain a healthy population, robust and efficient healthcare services are essential, as they play a crucial role in ensuring timely medical interventions and preventive care for everyone. This intensifies when healthcare services are getting increasingly expensive, leading to the risk, even for wealthy nations, of being unable to provide perfect healthcare for all [1, 2]. This is particularly true for patients with multi-morbidity, who have multiple chronic conditions simultaneously. The number of such patients is growing due to factors like socio-economic challenges and an aging population [3]. These patients require coordinated care from multiple medical specialties and more resources than ordinary patients [4]. As a result, improving healthcare services for them is essential to achieving effective healthcare for all. One approach to contribute towards this goal is by enhancing the clinical processes for multi-morbid patients.
A clinical process or care pathway defines the steps involved in diagnosing, treating, managing, and monitoring patients [5]. Enhancing a care pathway provides more efficient healthcare delivery, could lead to better perception from the patient, and improves operational performance within healthcare organizations. To achieve this, process mining [5, 6] is a technique that can be utilized. Process mining leverages event log data [6] to uncover, monitor, and optimize processes. Each event has clearly defined labels (activities) and occurs in a specific sequence based on its timestamp [6] and may be linked to particular entities relevant to the process. Healthcare services are multi-faceted and involve many different entities such as patients, admissions, disorders, caregivers, and so forth. Therefore, an approach to gain deeper insights into such multi-facteded care pathways involves using event logs where events are connected to multiple entities. This is known in process mining as "multi-entity event data". The concept is also referred to as an "object-centric event log [7]" or a "multi-dimensional event log [8]" depending on the focus on concrete physical objects or more general notation of entities. A recent approach advocates to store multi-entity event data form clinical processes in a Clinical Event Knowledge Graph (CEKG) [9].
The CEKG facilitates standardized interpretation of care pathways by linking event data to clinical terminologies like ICD10 and SNOMED-CT, discovering care pathway models by storing the event data in a graph representation. CEKG enables the synthesis and examination of data from multiple sources, such as event logs, diagnosis data, ICD codes, and SNOMED-CT concepts. By leveraging an event graph, CEKG aims to improve process mining by using meaningful, semantically driven relationships between different nodes representing entities, events, activities, and more. It allows for answering complex and flexible queries relating to clinical domain concepts. Additionally, supporting the hierarchical clinical terminology data, such as SNOMED-CT and ICD-10, enhances data navigation through path-based traversals.
In [9], Python scripts for creating a local prototype of a Clinical Event Knowledge Graph (CEKG) were introduced, and these scripts were later extended into an online tool for building and managing CEKGs in [10]. However, there is a lack of a large ready-to-use dataset for building such CEKG.
This paper aims to fill this gap by presenting MIMIC-IV-Ext-CEKG, a dataset designed to support the creation of CEKGs and process mining analysis from MIMIC. MIMIC-IV [11, 12, 13] is the most recent iteration of the MIMIC dataset series. It contains de-identified health records of patients admitted to the intensive care units at Beth Israel Deaconess Medical Center in Boston, Massachusetts, covering the years 2008 to 2019. This version builds upon MIMIC-III, improving data quality and extending the dataset to reflect more current ICU practices.
The main goal of introducing MIMIC-IV-Ext-CEKG is to deploy it in process mining, particularly in multi-entity or object-centric process mining, with a focus on building CEKG. Given that the MIMIC-IV dataset is widely used in data mining and machine learning approaches, the MIMIC-IV-Ext-CEKG, being derived from MIMIC-IV, can similarly be applied to data mining and machine learning tasks.
Methods
This section presents how to process and transform raw data from MIMIC-IV into the MIMIC-IV-Ext-CEKG format, ensuring its suitability for constructing Clinical Event Knowledge Graphs (CEKG) and broader analytical applications. The complete extraction process, including relevant queries and scripts, is available in the GitHub repository of the dataset [14].
Step 1: Setting Up Infrastructure and Tools for Dataset Creation
This step involves several sub-steps for setting up the necessary infrastructure to create the dataset, including configuring SQL-based RDBMS (relational database management system) platforms such as Google BigQuery or MariaDB, along with Python, and installing the required packages.
Step 2: Preparation of Input Data
The process of creating the dataset involves four main inputs:
Sub-step 1: MIMIC-IV Modules
We utilized various modules from MIMIC-IV, including MIMICIV_HOSP
, MIMICIV_ICU
, MIMICIV_DERIVED
, MIMICIV_ED
, and MIMICIV_NOTE
. The tables from these modules were categorized based on the nature of the data and the columns into different categories. In some cases, tables were renamed, split, or joined to ensure proper data alignment. For example, the table hcpcsevents
is left joined with d_hcpcs
on hcpcs = code
to include the required columns.
Sub-step 2: OHDSI Athena vocabulary system
The OHDSI Athena vocabulary system [15] offers a unified collection of healthcare vocabularies, including ICD-10, SNOMED-CT, and RxNorm, to standardize and facilitate large-scale healthcare data analysis across diverse sources. By integrating with the OMOP Common Data Model, Athena supports consistent, cross-referenced terminology management, enhancing both research and clinical insights [15]. This system was used to map and integrate data from diverse healthcare systems into a unified format at various sub-steps of the dataset creation process. we created a mapping from ICD codes to SNOMED-CT using the OHDSI platform.
Sub-step 3: SNOMED-CT Concepts
SNOMED-CT is an internationally recognized healthcare terminology that provides a systematic structure for encoding and organizing medical information, encompassing areas such as diseases, symptoms, procedures, clinical findings, and various other healthcare concepts [16]. To enrich the dataset with standardized clinical terminologies, we imported the full versions of the Description
, Relationship
, and TextDefinition
files from the SNOMED-CT RF2 database [17]. These files were preprocessed to create two key tables: one for SNOMED-CT Description
and another for SNOMED-CT Relationship
, ensuring that clinical concepts were accurately linked and represented.
Sub-step 4: SNOMED-CT Browser
Although not directly imported, the SNOMED-CT Browser [18] was used throughout the dataset creation process to access and verify terminology data. This tool was instrumental in manually checking and confirming the correct mapping of clinical concepts when necessary.
Step 3: Creation of Tables Related to Events
In this subsection, we discuss the sub-steps required for creating all tables related to events.
Sub-step 1: Preprocessing of Categorized Tables
We used the categorized tables from the MIMIC-IV dataset for the following tasks independently:
Task 1: Creating Identifier-Based Time Range (IDTR) Tables. We created tables for every possible combination of MIMIC-IV identifiers, including subject_id
, hadm_id
, stay_id
, and transfer_id
. These identifiers function as unique keys to connect and structure data across multiple tables, facilitating in-depth, patient-level analysis while ensuring privacy is maintained. Each table records the maximum and minimum timestamps, as well as the latest and earliest dates, for each unique instance of these identifier combinations.
Task 2: Event Log Source Tables. Since event logs are tables that require at least a timestamp, an activity, and entities, we need to ensure that tables representing event logs include a column for the timestamp, a column for the activity, and columns for entities. We identified three types of tables within the Categorized Tables.
- Timestamp-based Tables: Categorized tables that contain a timestamp column.
- Date-based Tables: Categorized tables that contain a date column.
- Measurement-based Tables: Categorized tables that include columns for measurements, such as the
first_day
measurement of specific metrics. Examples include columns likefirst_day_bg
,first_day_gcs
,first_day_height
,first_day_lab
, andfirst_day_rrt
.
Sub-step 2: Creating IDTR Enriched Tables
The objective of this sub-step is to use the IDTR tables, created in Sub-step 1, to convert the three identified types of tables within the Categorized Tables (Timestamp-based Tables, Date-based Tables, and Measurement-based Tables) into Event Log Finalized Tables. Event Log Finalized Tables are tables that include timestamps and two entities: subject_id
and hadm_id
.
- For Timestamp-based tables:
- If they already contain both
subject_id
andhadm_id
entities, no further action is needed. - If a table contains neither or only one of these identifiers, we perform a left join with an IDTR table that has the maximum number of common identifiers with that Timestamp-based table. If an identifier in a Timestamp-based table matches the IDTR table, and the timestamp falls between the minimum and maximum timestamps in the IDTR table, we can then retrieve the missing identifiers from the IDTR tables.
- If they already contain both
- For Date-based tables:
- First, we perform a left join on the date-based tables with the SH table, which we created in Sub-step 1 as one of the IDTR tables, under the following conditions:
- If they already contain both
subject_id
andhadm_id
entities, since the time part is missing in thedate_Column
, we first create a new table with only distinct pairs ofsubject_id
,hadm_id
, anddate_Column
. We then perform a left join with theSH
table. - If they have only the
hadm_id
entity and lack both thesubject_id
and the time part of thedate_Column
, we first create a new table with only distinct pairs ofhadm_id
anddate_Column
. Then, we perform a left join with theSH
table. - If they have only the
subject_id
entity and lack both thehadm_id
and the time part of thedate_Column
, we first create a new table with only distinct pairs ofsubject_id
anddate_Column
. We then perform a left join with the SH table, matching onsubject_id
and ensuring that thedate_Column
falls within a date range from the SH table. This step does not always result in a match, leaving some fields blank. Next, we identify how many times a distinctsubject_id
anddate_Column
pair matches multiplehadm_id
entries. If there is more than one match, it is considered a duplicate. We then separate the dataset into two tables: one for cases with a single match and another for cases with multiple matches. For cases with a single match, no further action is needed. However, for cases with multiple matches, we assign a ranking number to each match, indicating the encounter number for each distinct combination ofsubject_id
anddate_Column
. We retain only the highest-ranked match and discard the others. Afterward, we union these two tables back together. For any records still lacking ahadm_id
, we assign a default value, such as 0.
- If they already contain both
- Finally, for all of these conditions, based on whether the minimum and maximum timestamps are missing, we may have two types of records. For records without minimum and maximum timestamps, we convert
date_Column
into a timestamp type. For records with minimum and maximum timestamps, we calculate a new timestamp. If adding 12 hours to thedate_Column
falls within the range, we use it; otherwise, we use the midpoint between the minimum and maximum times for the timestamp. Finally, we merge this final table back with the original table to add the complete timestamp information.
- First, we perform a left join on the date-based tables with the SH table, which we created in Sub-step 1 as one of the IDTR tables, under the following conditions:
- For measurement-based tables:
- We perform a left join with an IDTR table that has the maximum number of common identifiers with the measurement-based table. Next, we add 24 hours to the minimum timestamp or date from the IDTR table and insert this as a new column in the measurement-based table.
Sub-step 3: Creating Cleaned Event Log Tables
In this step, we focus on data cleaning for the tables created in previous steps. This process improves comparability, searchability, and consistency across the tables. Additionally, it facilitates clearer connections between various tables. Two main data cleaning tasks are performed in this step:
- NDC (National Drug Code). Initially, pharmacy data from the tables
pharmacy
,eMAR
, andprescriptions
is merged, then unified based on their shared columns through a series of left joins. Next, the NDC codes are exported from these tables, resulting in a total of 5732 codes. Using web-based data crawling techniques with platforms like NDC List [19], HIPAASpace [20], and FDA Report [21], this number is expanded to 7376 by incorporating additional package details. Out of these codes, 6265 were verified with accurate product and package information, while 1087 required manual searches using resources such as GSN NDC, RxNorm, and proprietary databases. A small set of 24 codes remained unverified but were corrected using additional information from the related tables. - Lab tests related to Chemistry, Hematology, and Blood Gas: We consolidated the
Chemistry
,Hematology
, andBlood_Gas
tables into a single table that unifies the different categories of chemistry, hematology, and blood gas data. After this consolidation, key columns such ascategory
,fluid
, andlabel
were exported for further refinement. The next phase involved manually refining the data by modifying thefluid
descriptions into more specific categories, such as classifying blood into subcategories likeWhole Peripheral Blood Sample analysis
andPlasma Peripheral Blood Sample analysis
. Additionally, thelabel
column was enhanced by adding short synonyms, units of measurement, and defining normal ranges for the values, making the data more user-friendly and comprehensive for analysis. Once the refinement was complete, the processed dataset was imported back into the project.
Sub-step 4: Table Segmentation and Merging
In this step, we segment the tables based on the type of timestamp column and the values related to the nature of the data in the tables. For example, the POE
table was converted into ten tables based on different values in its field_name
column. Additionally, some tables are merged together. For example, the first_day_bg_time
and first_day_bg_art_time
tables are merged by adding a column to the combined table, with a value of "NA" for first_day_bg_time
records and "ART" for first_day_bg_art_time
records.
Sub-step 5: Integration of Activities
In this step, we utilize the preprocessed data from previous steps to identify and catalog clinical activities. We have identified 95 distinct activities, such as Admission_Order
, Admission_to_Careunit
, antibiotic_administrations
, Arterial_blood_sample_analysis_hemoglobin
, Coagulation
, BP_Measurement
, Consultation
and others. Some activities are obtained by merging tables, some by splitting, and some without any changes to the tables. The results are organized into two tables for each activity instance.
The first table, the Activity Instances
table, records each occurrence of an activity along with its specifics and contains the following columns:
subject_id
.hadm_id
.Timestamps
: The specific time at which the activity was recorded.Activity
: The name of the activity, such as Blood Gas Test or BP measurement.Activity_Synonym
: Contains abbreviations of activity labels, e.g., BGT for Blood Gas Test.Activity_Properties_ID_aggregation
: A unique foreign key ID for each distinct feature and its value.Activity_Value_ID
: A unique foreign key identifier for each distinct activity instance based on its combined features and values.
The second table, the Activity Properties
table, details the properties associated with each activity instance and contains the following columns:
Activity_Properties_ID
: Links back to the first table as a foreign key.Activity
.Activity_Synonym
.featureName
: The attribute name of the feature.featureValue
: The numeric or descriptive value.
Sub-step 6: Creating the Final Tables
In this sub-step, we create all the tables related to the event log. These tables are listed below:
Event_Log
: This table contains the main event log.ActivityAttributes
: This table holds the event activities along with their associated activity feature tables.ActivitiesDomain
: This table includes the domain of event activities, which was manually curated.CNM3
: This table is part of the Constrained Node Mapping (CNM) used to construct the CEKG. CNM functions help build relationships between nodes with different labels, and they can be derived from data, domain knowledge, expert input, documents, and sometimes machine learning models [9]. TheCNM3
table manually maps event activities to SNOMED-CT concepts using the SNOMED-CT Browser.CNM4_1
: Another CNM table that manually maps event activities to activity domains.CNM4_2
: This CNM table maps the domain of activities to SNOMED-CT concepts manually, also using the SNOMED-CT Browser.
Step 4: Creation of Tables Related to ICD Codes
Sub-step 1: ICD Code Correction
The MIMIC-IV dataset contains several issues related to its ICD codes, including a single column with 25K non-comparable codes, incorrect formatting, a mix of outdated ICD-9 and ICD-10 codes, and inconsistent levels of ICD code granularity. To address these issues, the goal is to create two columns: one with 1,760 high-level ICD codes and another with 19K comparable ICD codes. The ICD codes will be reformatted for accuracy (e.g., “253.5” for Diabetes insipidus), ensuring only ICD-10 codes are used, with a structure that supports hierarchical relationships (e.g., “E11” for diabetes type 2 and “E11.329” for diabetes type 2 with macular edema). The process involves exporting all ICD codes from MIMIC-IV and correcting them through automated data crawling and manual comparison with the ICD10Data website [22].
Sub-step 2: Implementing ICD Code Corrections
Using the ICD correction mapper that we built in sub-step 1, we correct the ICD codes of the MIMIC-IV.
Sub-step 3: Creating the Final Tables
ICD
: This is the final ICD table with columnsicd_code
,icd_version
, andicd_code_title
that we need for the CEKG framework.CNM1
: This is a CNM that mapsDisorders_ID
toICD_Codes
. The CEKG framework assumes that, in some cases, disorders may not be mapped to an ICD code, making this CNM essential for the framework to function. Since disorders in MIMIC-IV are coded in ICD, we created hypothetical IDs for each disorder.CNM2
: This CNM maps the ICD code to SNOMED-CT. For mapping ICD code to SNOMED-CT, we use the OHSI mapper built in the input section. However, we have to manually map 8 out of 1760 ICD codes to SNOMED-CT.CNM5 Tables
: In the CEKG framework, CNM5 maps events (using theActivity_Instance_ID
) to new entities, such as patient disorders. However, since the MIMIC-IV dataset lacks direct relationships between events and disorders, this mapping cannot be performed directly. To overcome this, one approach is to leverage domain knowledge and expert input, while another option is to apply machine learning algorithms. In this sub-step, we created tables to facilitate the creation ofCNM5
.
Step 5: Creation of Tables Related to Entity Activities
Sub-step 1: Attributes Preparation
Each entity can have several attributes, which can either be used as entities themselves or solely as attributes. For example, age, gender, and admission sequence are attributes of the Patient entity, as each patient has an age, a gender, and an admission sequence. Additionally, multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity are attributes of the Admission entity. Similarly, each disorder is an attribute of multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity. These attributes can also have relationships. In this sub-step, we prepare these attributes and their relationships.
Sub-step 2: Creating the Final Tables
EntitiesAttributes
: We create the final entities attributes table using the data prepared in the previous sub-step.EntitiesAttributeRel
: We create the final entities attributes relationship table using the data prepared in the previous sub-step.
Step 6: Creation of Tables Related to SNOMED-CT Codes
Sub-step 1: Refining SNOMED-CT Data
In this sub-step, we refine the integration of SNOMED-CT data by focusing on the descriptions and relationships in the input step. We specifically consider those SNOMED-CT codes that were mapped from activities, domains, and ICDs, including all related SNOMED-CT codes up to the root concept. This process deepens the dataset by linking each SNOMED-CT code used or mapped in the project to its root concept through hierarchical relationships. This linkage is essential for structured clinical data analysis and ensures that only the necessary SNOMED-CT codes are imported, avoiding the need to import the entire SNOMED-CT system. Additionally, in this sub-step, we add a column labeled level
, which is an index used to show the distance of a SNOMED-CT ID from the root SNOMED-CT ID (138875005). Sometimes, there are different paths to navigate from a SNOMED-CT ID to the root SNOMED-CT ID, so it may have more than one level. This index facilitates and enhances the speed of queries.
Sub-step 2: Creating the Final Tables
SCT Node
: We create the final SNOMED-CT concept table using the data prepared in the previous sub-step.SCT Rel
: We create the final table that shows the relationship between SNOMED-CT concepts using the data prepared in the previous sub-step.
Step 7: Creation of Cluster Reference Tables
In some use cases, we do not need to use the entire dataset; we need to select part of the dataset that satisfies our need. In this step, we create two tables that may facilitate clustering of the dataset.
ClusterReference1
: This table includes patients with their number of admissions, number of disorders, gender, anchor age, and life status.ClusterReference2
: This table includes patients with their disorders (using ICD codes), number of disorders, and number of admissions.
Step 8: Publishing the Dataset
In this step, we rename some tables and their columns for consistency and matching to the CEKG framework. For more detailed information about the final dataset tables, including how to access them, the meaning of each table and column, and their use cases, please refer to the next section.
Data Description
MIMIC-IV-Ext-CEKG contains one module, that contains 19 tables that extracted from MIMIC-IV dataset. In the following we describe each of these tables.
B_EventLog
This table is the event log, which can be either a single-entity or multi-entity event log. Entities represent distinct existences. Sometimes, the terms “case notion,” “case,” “object,” and “dimensional” are used interchangeably. The term "multi-entity event log" is sometimes considered equivalent to “object-centric event log” or “multi-dimensional event log.” In the multi-entity event log definition, each entity is defined with its origin and IDs. The table contains several columns:
Event_ID:
Contains the ID of each event.Timestamp:
Contains the time and date of activities.Activity:
Consists of the activity label of the event.Activity_Synonym:
Contains abbreviations of activity labels. For example, BGT for Blood Gas Test.Activity_Attributes_ID:
A unique foreign key ID for each distinct feature and value. For example:- po2=295 →
Activity_Attributes_ID=1
- lactate=3.23 →
Activity_Attributes_ID=2
- Blood pressure=137/79 →
Activity_Attributes_ID=3
- po2=412 →
Activity_Attributes_ID=4
(same feature but different value, so a different ID) - lactate=0.73 →
Activity_Attributes_ID=5
(same feature but different value, so a different ID) - po2=295 →
Activity_Attributes_ID=1
(same feature and same value, so the same ID) - lactate=3.23 →
Activity_Attributes_ID=2
(same feature and same value, so the same ID)
- po2=295 →
Activity_Instance_ID
A unique foreign key identifier for each distinct activity, considering its features and values. For example:- First event: Blood Gas Test: po2=295, lactate=3.23 →
Activity_Instance_ID=1
- Second event: BP_measurement: Blood pressure=137/79 →
Activity_Instance_ID=2
(different activity from the first event) - Third event: Blood Gas Test: po2=412, lactate=0.73 →
Activity_Instance_ID=3
(same activity as the first event but with different feature values) - Fourth event: Blood Gas Test: po2=295, lactate=3.23 →
Activity_Instance_ID=1
(same activity as the first event with the same feature values)
- First event: Blood Gas Test: po2=295, lactate=3.23 →
Entity1_origin
andEntity1_ID:
Contain the origin and ID of each instance of the first entity representing the ID of each patient, equivalent tosubject_id
in MIMIC.Entity2_origin
andEntity2_ID:
Contain the origin and ID of each instance of the second entity representing the ID of each admission, equivalent tohadm_id
in MIMIC.Entity3_origin
andEntity3_ID:
Contain the origin and ID of each instance of the third entity representing the ID of each outpatient encounter. This ID is newly created for each distinct outpatient encounter for a patient.Entity4_origin
andEntity4_ID:
Contain the origin and ID of each instance of the fourth entity representing the ID of each Admission_Sequence. This ID is consistent for the patient; for example, for all patients, the first admission is 1, the second is 2, and so on.Entity5_origin
andEntity5_ID:
Contain the origin and ID of each instance of the fifth entity representing the ID of each Outpatient_Sequence. This ID is consistent for the patient; for example, for all patients, the first outpatient visit is 1, the second is 2, and so on.temp_patient_id
andtemp_encounter_id:
Thetemp_patient_id
stores the patient IDs, while thetemp_encounter_id
contains both admission and outpatient IDs. These columns are used when we want to analyze a subset of the data, allowing us to select specific patients, admission IDs, or outpatient IDs, typically after clustering the patients. It is important to note that, whether we use a portion of the table or the entire dataset, these two columns should be excluded from the final event log when building the CEKG.
C_EntitiesAttributes
This table contains the attributes of our entities. Each entity can have several attributes, which can either be used as entities themselves or only as attributes.
For example, age, gender, and admission are attributes of the Patient entity, as each patient has an age, gender, and admission sequence. Additionally, multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity are attributes of the Admission entity. Similarly, each disorder is an attribute of multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity.
Origin:
This column shows the type of attribute.ID:
This column shows the ID of the attribute.Name:
This column contains a mix of synonyms for origins and IDs.Value:
This column contains the value of the attribute, if it exists.Category:
This column has the value "absolute" for all attributes that are only used for data analysis.temp_patient_id
andtemp_encounter_id:
These serve the same purpose as inB_Eventlog
table.
D_EntitiesAttributeRel
This table shows the relationship between entities and their attributes.
Origin1:
This column contains the origin of the first entity or entity attribute.ID1:
This column contains the ID of the first entity or entity attribute.Origin2:
This column contains the origin of the second entity or entity attribute.ID2:
This column contains the ID of the second entity or entity attribute.temp_patient_id
andtemp_encounter_id:
These serve the same purpose as inB_Eventlog
table.
E_ActivityAttributes
This table shows the activity attributes.
Activity_Attributes_ID:
This column contains a foreign key that relates to the event log table.Activity:
This column shows the activity, corresponding to theActivity
column in the event log table.Activity_Synonym:
This column shows the synonym for the activity, with a corresponding column of the same name in the event log table.Activity_Attribute:
This column contains the attributes.Activity_Attribute_Value:
This column contains the values of the attributes.temp_patient_id
andtemp_encounter_id:
These serve the same purpose as inB_Eventlog
table.
F_ActivitiesDomain
The F_ActivitiesDomain
table contains the activity domains and consists of only one column.
G_ICD
This table contains a subset of our ICD codes.
ICD_Origin:
This column contains values for all ICD entries. It is an auxiliary column used solely for data analysis.ICD_Code:
This column shows the ICD codes.ICD_Version:
This column shows the version of the ICD codes.ICD_Code_Title:
This column shows the titles of the ICD codes.
H_SCT_Node
This table contains a subset of our SNOMED-CT concept codes.
SCT_ID:
This column contains the SNOMED-CT ID.SCT_Code:
This column is an auxiliary column used in this table, not related to SNOMED-CT terminology.SCT_DescriptionA_Type1:
This column shows the description of SNOMED-CT IDs with their semantic tag in parentheses.SCT_DescriptionA_Type2:
This column shows the description of SNOMED-CT IDs without their semantic tag in parentheses.SCT_DescriptionB:
This column shows another description of SNOMED-CT IDs, which exists only for some of them.SCT_Semantic_Tags:
This column contains the semantic tags of SNOMED-CT IDs.SCT_Type:
This column categorizes SNOMED-CT into three categories: root (only one ID, 138875005), top-level concept (we have 18 SNOMED-CTs), and concept (all other IDs besides root and top-level concepts).SCT_Level:
This index shows the distance of a SNOMED-CT ID from the root SNOMED-CT ID (138875005). A SNOMED-CT ID may have more than one level.
H_SCT_REL
This table shows the relationships between SNOMED-CT concepts.
SCT_ID_1:
The ID of the first SNOMED-CT concept node.SCT_Code_1:
The code of the first SNOMED-CT concept node.SCT_ID_2:
The ID of the second SNOMED-CT concept node.SCT_Code_2:
The code of the second SNOMED-CT concept node.
I_CNM1
This table shows the constrained node mappings derived from the MIMIC-IV dataset, which relate each Disorders ID (an attribute of multimorbidity) to each ICD code.
Disorder_ID:
This column shows the disorder attribute identifier.ICD_Code:
This column contains the ICD code.
J_CNM2
This table shows the constrained node mappings derived from "OHDSI Athena" for relating ICD codes to SNOMED-CT.
ICD_Code:
This column contains the ICD codes.SCT_ID:
This column contains the SNOMED-CT IDs.
K_CNM3
This table shows the constrained node mappings derived manually by searching to relate activities to SNOMED-CT concepts.
Activity:
This column shows the activity, corresponding to the "Activity" column in the event log table.Activity_Synonym:
This column shows the synonym for the activity, with a corresponding column of the same name in the event log table.SCT_ID:
This column contains the SNOMED-CT IDs.SCT_Code:
This column contains the SNOMED-CT codes.
L_CNM4_1
This table shows the constrained node mappings derived manually by searching to relate activities to domains.
Activity:
This column shows the activity, corresponding to theActivity
column in the event log table.Activity_Synonym:
This column shows the synonym for the activity, with a corresponding column of the same name in the event log table.Activity_Domain:
This column shows the domain of activities.
L_CNM4_2
This table shows the constrained node mappings derived manually by searching to relate the domain of activities to SNOMED-CT concepts.
Activity_Domain:
This column shows the domain of activities.SCT_ID:
This column contains the SNOMED-CT IDs.SCT_Code:
This column contains the SNOMED-CT codes.
M_CNM5_Activity_Instance_ID
This table contains a list of Activity_Instance_ID
s.
Activity_Instance_ID:
This column contains the activity instance identifiers. This foreign key can be linked to the event log table.
M_CNM5_Activity_Instance_ID_with_features_Entities
This table contains the features related to Activity_Instance_ID
.
Activity_Instance_ID:
This column contains the activity instance identifiers. This foreign key can be linked to the event log table.Activity:
This column shows the activity, corresponding to the "Activity" column in the event log table.Activity_Synonym:
This column shows the synonym for the activity, corresponding to the "Activity_Synonym" column in the event log table.Activity_Attribute:
This column contains the attributes.Activity_Attribute_Value:
This column contains the values of the attributes.Entity1_ID:
Contains the ID of each patient, equivalent tosubject_id
in MIMIC.Entity2_ID:
Contains the ID of each admission, equivalent tohadm_id
in MIMIC.
M_CNM5_Activity_Instance_ID_with_class_Entities
This table contains the classes related to Activity_Instance_ID
.
Activity_Instance_ID:
This column contains the activity instance identifiers. This foreign key can be linked to the event log table.Disorder_Name:
This column contains the name of disorder attributes.Entity1_ID:
Contains the ID of each patient, equivalent tosubject_id
in MIMIC.Entity2_ID:
Contains the ID of each admission, equivalent tohadm_id
in MIMIC.
M_CNM5_class
This table shows the relationship between Disorder_Name
and Disorder_ID
.
Disorder_Name:
This column contains the name of disorder attributes.Disorder_ID:
This column contains the identifiers of disorder attributes.
N_ClusterReference1
This first table that can be used for clustering the patients.
temp_patient_id:
It stores the patient IDs.Entity1_ID:
Contains the ID of each patient, equivalent tosubject_id
in MIMIC.Morbid_num:
Number of disorders across all admissions.Admission_num:
Number of admissions.gender:
Gender of patients.anchor_age:
Anchor age of patients.dod:
Mortality status of patients (whether they died or not).
N_ClusterReference2
This second table that can be used for clustering the patients.
temp_patient_id:
It stores the patient IDs.temp_encounter_id:
It can contain both admission and outpatient IDs.Entity1_ID:
Contains the ID of each patient, equivalent tosubject_id
in MIMIC.Entity2_ID:
Contains the ID of each admission, equivalent tohadm_id
in MIMIC.ICD10_Code:
This column shows the ICD-10 codes.ICD10_Code_Root:
This column shows the root concept of the ICD-10 codes.ICD10_Code_title:
This column shows the titles of the ICD-10 codes.ICD10_Code_Root_title:
This column shows the titles of the root concept of the ICD-10 codes.Morbid_num:
Number of disorders across all admissions.Admission_num:
Number of admissions.
Usage Notes
The MIMIC-IV-Ext-CEKG event log, generated in this study, is derived from various modules of the MIMIC-IV dataset. Therefore, authorized access to the MIMIC-IV-Ext-CEKG dataset is essential to effectively utilize the MIMIC-IV-Ext-CEKG.
Although this dataset is specifically designed for constructing clinical event knowledge graphs and working with tools developed for that purpose, it can also be applied to a wide range of process mining tools. For instance, the B_EventLog
and E_ActivityAttributes
tables can be employed in any process mining or data mining tasks. Moreover, since the MIMIC-IV dataset is extensively utilized in data mining and machine learning applications, the MIMIC-IV-Ext-CEKG, being derived from MIMIC-IV, can likewise be employed for data mining and machine learning tasks.
However, the possibility of training machine learning models on MIMIC-IV-Ext-CEKG has not yet been examined. Tables such as M_CNM5_Activity_Instance_ID
, CNM5_Activity_Instance_ID_with_features
, CNM5_Activity_Instance_ID_with_class
, and CNM5_class
were created to facilitate this. Additionally, these tables can be adapted for training time series analysis models by incorporating timestamps using B_EventLog
table.
The MIMIC-IV-Ext-CEKG also addressed several existing data cleaning issues in MIMIC-IV:
- Corrected ICD-9 codes and converted them to ICD-10, as ICD-9 is outdated, while ICD-10 offers an improved and updated hierarchical structure that can be converted to ICD-11.
- Corrected ICD-10 codes based on titles.
- Corrected all NDC codes in the "prescriptions" table to identify repetitive codes, enhance comparability and searchability, and accurately capture product and package codes, proprietary names, dosages, and route name. This also improved connections with GSN and product codes.
- Merged the
labevent
andd_labitems
tables, then corrected all fluid names, label names, units, and normal ranges. Fluid descriptions were modified to reflect narrower, more specific categories than the original labels. For example, blood was categorized as "Peripheral Blood Sample Analysis - Whole," "Peripheral Blood Sample Analysis - Plasma," etc. Additionally, short synonyms were added to labels, units of measurement were included, and normal ranges were defined.
While the MIMIC-IV-Ext-CEKG dataset enhances the original MIMIC-IV, but it is essential for users to be aware of certain limitations that may impact their research. One key issue is the continuation of the practice from the original dataset, where MIMIC dates are altered by applying a patient-specific offset. This alteration could affect the analysis of precise timing in relation to external events or records. Additionally, the dataset is curated to support process-oriented analyses by including only data from certain MIMIC-IV modules (mimiciv_hosp, mimiciv_icu, and mimiciv_derived modules) that provide time and date details. This selective inclusion may leave out other significant activities that lack precise timestamps, thereby reducing the overall breadth of clinical activities represented in the dataset. Understanding these constraints is vital for users to effectively navigate potential challenges when working with the MIMIC-IV-Ext-CEKG dataset.
Ethics
MIMIC-IV-Ext-CEKG is extracted from MIMIC-IV, MIMIC-IV-ED, and MIMIC-IV-Note and exists under the same IRB as they do.
Acknowledgements
This dataset is built upon the MIMIC-IV dataset, and we extend our deepest gratitude to the authors and creators of MIMIC-IV for their invaluable work. We would also like to express special thanks to Alistair Johnson for his guidance and for answering questions related to the possibility of creating this dataset based on MIMIC-IV.
Conflicts of Interest
The author(s) declare no conflicts of interest.
References
- Xiong X, Li VJ, Huang B, Huo Z. Equality and social determinants of spatial accessibility, availability, and affordability to primary health care in Hong Kong, a descriptive study from the perspective of spatial analysis. BMC Health Serv Res. 2022;22(1):1364.
- Caballo B, Dey S, Prabhu P, Seal B, Chu P. The effects of socioeconomic status on the quality and accessibility of healthcare services. Across The Spectrum of Socioeconomics: Issue IV. 2021;236:1-16.
- Marengoni A, Angleman S, Melis R, Mangialasche F, Karp A, Garmen A, Meinow B, Fratiglioni L. Aging with multimorbidity: a systematic review of the literature. Ageing Res Rev. 2011;10(4):430-9.
- Soley-Bori M, Ashworth M, Bisquera A, Dodhia H, Lynch R, Wang Y, Fox-Rushby J. Impact of multimorbidity on healthcare costs and utilisation: a systematic review of the UK literature. Br J Gen Pract. 2021;71(702):e39-e46.
- Munoz-Gama J, Martin N, Fernandez-Llatas C, Johnson OA, Sepúlveda M, Helm E, Galvez-Yanjari V, Rojas E, Martinez-Millana A, Aloini D, et al. Process mining for healthcare: Characteristics and challenges. J Biomed Inform. 2022;127:103994.
- Van Der Aalst W, van der Aalst W. Data science in action. Springer; 2016.
- Van der Aalst WMP. Object-centric process mining: dealing with divergence and convergence in event data. In: Software Engineering and Formal Methods: 17th International Conference, SEFM 2019; 2019 Sep 18-20; Oslo, Norway. Springer; 2019. p. 3-25.
- Esser S, Fahland D. Multi-dimensional event data in graph databases. J Data Semant. 2021;10(1):109-41.
- Naeimaei Aali M, Mannhardt F, Toussaint PJ. Clinical Event Knowledge Graphs: Enriching Healthcare Event Data with Entities and Clinical Concepts-Research Paper. In: Proceedings of the International Conference on Process Mining; 2023. Springer; p. 296-308.
- Naeimaei Aali M, Mannhardt F, Toussaint PJ. The CEKG: A Tool for Constructing Event Graphs in the Care Pathways of Multi-Morbid Patients. In: Proceedings of the Doctoral Consortium and Demo Track at the International Conference on Process Mining 2024, co-located with the 6th International Conference on Process Mining (ICPM 2024); 2024. CEUR-WS.org; vol 3783, p. 6.
- Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.
- Johnson A, Bulgarelli L, Pollard T, Gow B, Moody B, Horng S, Celi L A, Mark R. MIMIC-IV (version 3.1). PhysioNet. 2024. Available from: https://doi.org/10.13026/kpb9-mt58.
- Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
- Naeimaei Aali M. MIMIC-IV-Ext-CEKG [Internet]. Available from: https://github.com/mnaeimaei/MIMIC-IV-Ext-CEKG
- Observational Health Data Sciences and Informatics (OHDSI). The OHDSI Collaborative [Internet]. Available from: https://ohdsi.org
- SNOMED International, College of American Pathologists. SNOMED CT [Internet]. 2021. Available from: https://www.snomed.org/
- National Library of Medicine. SNOMED CT International [Internet]. Available from: https://www.nlm.nih.gov/healthit/snomedct/international.html
- SNOMED International. SNOMED CT Browser [Internet]. Available from: https://browser.ihtsdotools.org/
- National Drug Code List. NDC List [Internet]. Available from: https://www.ndclist.com/
- HIPAA Space. HIPAA Space [Internet]. Available from: https://www.hipaaspace.com/
- FDA. FDA Report [Internet]. Available from: https://fda.report/
- ICD10Data.com. ICD-10 Data [Internet]. Available from: https://www.ICD10Data.com/
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/qr9d-6t52
DOI (latest version):
https://doi.org/10.13026/wqvp-h188
Topics:
mimic
process mining
multi entity process mining
object centric event log
clinical event knowledge graph
Project Website:
https://github.com/mnaeimaei/MIMIC-IV-Ext-CEKG
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project