Database Credentialed Access

MIMIC-IV-Ext-CEKG: A Process-Oriented Dataset Derived from MIMIC-IV for Enhanced Clinical Insights

Milad Naeimaei Aali Felix Mannhardt Pieter Jelle Toussaint

Published: April 8, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Naeimaei Aali, M., Mannhardt, F., & Toussaint, P. J. (2025). MIMIC-IV-Ext-CEKG: A Process-Oriented Dataset Derived from MIMIC-IV for Enhanced Clinical Insights (version 1.0.0). PhysioNet. https://doi.org/10.13026/qr9d-6t52.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Maintaining a healthy population is essential for improving quality of life and overall societal well-being. One approach to achieving a healthy population is by improving patients' care pathways. This is particularly vital for patients with multiple chronic conditions, who require well-coordinated care across various medical specialties. One approach to improving and analyzing the care pathways of these types of patients is process mining. The clinical event knowledge graph is a recent framework in process mining introduced for patients with multimorbidity that facilitates standardized interpretation of care pathways by linking to ICD-10 and SNOMED-CT. It also facilitates storing recent multi-entity event data in the event graph and analyzing care pathways for multimorbid patients from multiple perspectives. MIMIC-IV is a dataset that facilitates data analysis in healthcare; however, it is not specialized for process mining and requires extensive data preprocessing to prepare it for process mining. This paper contributes to the MIMIC-IV-Ext-CEKG dataset, an extracted dataset from MIMIC-IV that facilitates using the Clinical Event Knowledge Graph framework and other process mining tasks. This paper describes its characteristics and how it is extracted from the MIMIC-IV dataset. MIMIC-IV-Ext-CEKG facilitates deploying MIMIC-IV for process mining.


Background

To maintain a healthy population, robust and efficient healthcare services are essential, as they play a crucial role in ensuring timely medical interventions and preventive care for everyone. This intensifies when healthcare services are getting increasingly expensive, leading to the risk, even for wealthy nations, of being unable to provide perfect healthcare for all [1, 2]. This is particularly true for patients with multi-morbidity, who have multiple chronic conditions simultaneously. The number of such patients is growing due to factors like socio-economic challenges and an aging population [3]. These patients require coordinated care from multiple medical specialties and more resources than ordinary patients [4]. As a result, improving healthcare services for them is essential to achieving effective healthcare for all. One approach to contribute towards this goal is by enhancing the clinical processes for multi-morbid patients.

A clinical process or care pathway defines the steps involved in diagnosing, treating, managing, and monitoring patients [5]. Enhancing a care pathway provides more efficient healthcare delivery, could lead to better perception from the patient, and improves operational performance within healthcare organizations. To achieve this, process mining [5, 6] is a technique that can be utilized. Process mining leverages event log data [6] to uncover, monitor, and optimize processes. Each event has clearly defined labels (activities) and occurs in a specific sequence based on its timestamp [6] and may be linked to particular entities relevant to the process. Healthcare services are multi-faceted and involve many different entities such as patients, admissions, disorders, caregivers, and so forth. Therefore, an approach to gain deeper insights into such multi-facteded care pathways involves using event logs where events are connected to multiple entities. This is known in process mining as "multi-entity event data". The concept is also referred to as an "object-centric event log [7]" or a "multi-dimensional event log [8]" depending on the focus on concrete physical objects or more general notation of entities. A recent approach advocates to store multi-entity event data form clinical processes in a Clinical Event Knowledge Graph (CEKG) [9].

The CEKG facilitates standardized interpretation of care pathways by linking event data to clinical terminologies like ICD10 and SNOMED-CT, discovering care pathway models by storing the event data in a graph representation. CEKG enables the synthesis and examination of data from multiple sources, such as event logs, diagnosis data, ICD codes, and SNOMED-CT concepts. By leveraging an event graph, CEKG aims to improve process mining by using meaningful, semantically driven relationships between different nodes representing entities, events, activities, and more. It allows for answering complex and flexible queries relating to clinical domain concepts. Additionally, supporting the hierarchical clinical terminology data, such as SNOMED-CT and ICD-10, enhances data navigation through path-based traversals.

In [9], Python scripts for creating a local prototype of a Clinical Event Knowledge Graph (CEKG) were introduced, and these scripts were later extended into an online tool for building and managing CEKGs in [10]. However, there is a lack of a large ready-to-use dataset for building such CEKG.

This paper aims to fill this gap by presenting MIMIC-IV-Ext-CEKG, a dataset designed to support the creation of CEKGs and process mining analysis from MIMIC. MIMIC-IV [11, 12, 13] is the most recent iteration of the MIMIC dataset series. It contains de-identified health records of patients admitted to the intensive care units at Beth Israel Deaconess Medical Center in Boston, Massachusetts, covering the years 2008 to 2019. This version builds upon MIMIC-III, improving data quality and extending the dataset to reflect more current ICU practices.

The main goal of introducing MIMIC-IV-Ext-CEKG is to deploy it in process mining, particularly in multi-entity or object-centric process mining, with a focus on building CEKG. Given that the MIMIC-IV dataset is widely used in data mining and machine learning approaches, the MIMIC-IV-Ext-CEKG, being derived from MIMIC-IV, can similarly be applied to data mining and machine learning tasks.


Methods

This section presents how to process and transform raw data from MIMIC-IV into the MIMIC-IV-Ext-CEKG format, ensuring its suitability for constructing Clinical Event Knowledge Graphs (CEKG) and broader analytical applications. The complete extraction process, including relevant queries and scripts, is available in the GitHub repository of the dataset [14].

Step 1: Setting Up Infrastructure and Tools for Dataset Creation

This step involves several sub-steps for setting up the necessary infrastructure to create the dataset, including configuring SQL-based RDBMS (relational database management system) platforms such as Google BigQuery or MariaDB, along with Python, and installing the required packages.

Step 2: Preparation of Input Data

The process of creating the dataset involves four main inputs:

Sub-step 1: MIMIC-IV Modules

We utilized various modules from MIMIC-IV, including MIMICIV_HOSP, MIMICIV_ICU, MIMICIV_DERIVED, MIMICIV_ED, and MIMICIV_NOTE. The tables from these modules were categorized based on the nature of the data and the columns into different categories. In some cases, tables were renamed, split, or joined to ensure proper data alignment. For example, the table hcpcsevents is left joined with d_hcpcs on hcpcs = code to include the required columns.

Sub-step 2: OHDSI Athena vocabulary system

The OHDSI Athena vocabulary system [15] offers a unified collection of healthcare vocabularies, including ICD-10, SNOMED-CT, and RxNorm, to standardize and facilitate large-scale healthcare data analysis across diverse sources. By integrating with the OMOP Common Data Model, Athena supports consistent, cross-referenced terminology management, enhancing both research and clinical insights [15]. This system was used to map and integrate data from diverse healthcare systems into a unified format at various sub-steps of the dataset creation process. we created a mapping from ICD codes to SNOMED-CT using the OHDSI platform.

Sub-step 3: SNOMED-CT Concepts

SNOMED-CT is an internationally recognized healthcare terminology that provides a systematic structure for encoding and organizing medical information, encompassing areas such as diseases, symptoms, procedures, clinical findings, and various other healthcare concepts [16]. To enrich the dataset with standardized clinical terminologies, we imported the full versions of the Description, Relationship, and TextDefinition files from the SNOMED-CT RF2 database [17]. These files were preprocessed to create two key tables: one for SNOMED-CT Description and another for SNOMED-CT Relationship, ensuring that clinical concepts were accurately linked and represented.

Sub-step 4: SNOMED-CT Browser

Although not directly imported, the SNOMED-CT Browser [18] was used throughout the dataset creation process to access and verify terminology data. This tool was instrumental in manually checking and confirming the correct mapping of clinical concepts when necessary.

Step 3: Creation of Tables Related to Events

In this subsection, we discuss the sub-steps required for creating all tables related to events.

Sub-step 1: Preprocessing of Categorized Tables

We used the categorized tables from the MIMIC-IV dataset for the following tasks independently:

Task 1: Creating Identifier-Based Time Range (IDTR) Tables. We created tables for every possible combination of MIMIC-IV identifiers, including subject_id, hadm_id, stay_id, and transfer_id. These identifiers function as unique keys to connect and structure data across multiple tables, facilitating in-depth, patient-level analysis while ensuring privacy is maintained. Each table records the maximum and minimum timestamps, as well as the latest and earliest dates, for each unique instance of these identifier combinations.

Task 2: Event Log Source Tables. Since event logs are tables that require at least a timestamp, an activity, and entities, we need to ensure that tables representing event logs include a column for the timestamp, a column for the activity, and columns for entities. We identified three types of tables within the Categorized Tables.

  • Timestamp-based Tables: Categorized tables that contain a timestamp column.
  • Date-based Tables: Categorized tables that contain a date column.
  • Measurement-based Tables: Categorized tables that include columns for measurements, such as the first_day measurement of specific metrics. Examples include columns like first_day_bg, first_day_gcs, first_day_height, first_day_lab, and first_day_rrt.

Sub-step 2: Creating IDTR Enriched Tables

The objective of this sub-step is to use the IDTR tables, created in Sub-step 1, to convert the three identified types of tables within the Categorized Tables (Timestamp-based Tables, Date-based Tables, and Measurement-based Tables) into Event Log Finalized Tables. Event Log Finalized Tables are tables that include timestamps and two entities: subject_id and hadm_id .

  • For Timestamp-based tables:
    • If they already contain both subject_id and hadm_id entities, no further action is needed.
    • If a table contains neither or only one of these identifiers, we perform a left join with an IDTR table that has the maximum number of common identifiers with that Timestamp-based table. If an identifier in a Timestamp-based table matches the IDTR table, and the timestamp falls between the minimum and maximum timestamps in the IDTR table, we can then retrieve the missing identifiers from the IDTR tables.
  • For Date-based tables:
    • First, we perform a left join on the date-based tables with the SH table, which we created in Sub-step 1 as one of the IDTR tables, under the following conditions:
      • If they already contain both subject_id and hadm_id entities, since the time part is missing in the date_Column, we first create a new table with only distinct pairs of subject_id, hadm_id, and date_Column. We then perform a left join with the SH table.
      • If they have only the hadm_id entity and lack both the subject_id and the time part of the date_Column, we first create a new table with only distinct pairs of hadm_id and date_Column. Then, we perform a left join with the SH table.
      • If they have only the subject_id entity and lack both the hadm_id and the time part of the date_Column, we first create a new table with only distinct pairs of subject_id and date_Column. We then perform a left join with the SH table, matching on subject_id and ensuring that the date_Column falls within a date range from the SH table. This step does not always result in a match, leaving some fields blank. Next, we identify how many times a distinct subject_id and date_Column pair matches multiple hadm_id entries. If there is more than one match, it is considered a duplicate. We then separate the dataset into two tables: one for cases with a single match and another for cases with multiple matches. For cases with a single match, no further action is needed. However, for cases with multiple matches, we assign a ranking number to each match, indicating the encounter number for each distinct combination of subject_id and date_Column. We retain only the highest-ranked match and discard the others. Afterward, we union these two tables back together. For any records still lacking a hadm_id, we assign a default value, such as 0.
    • Finally, for all of these conditions, based on whether the minimum and maximum timestamps are missing, we may have two types of records. For records without minimum and maximum timestamps, we convert date_Column into a timestamp type. For records with minimum and maximum timestamps, we calculate a new timestamp. If adding 12 hours to the date_Column falls within the range, we use it; otherwise, we use the midpoint between the minimum and maximum times for the timestamp. Finally, we merge this final table back with the original table to add the complete timestamp information.
  • For measurement-based tables:
    • We perform a left join with an IDTR table that has the maximum number of common identifiers with the measurement-based table. Next, we add 24 hours to the minimum timestamp or date from the IDTR table and insert this as a new column in the measurement-based table.

Sub-step 3: Creating Cleaned Event Log Tables

In this step, we focus on data cleaning for the tables created in previous steps. This process improves comparability, searchability, and consistency across the tables. Additionally, it facilitates clearer connections between various tables. Two main data cleaning tasks are performed in this step:

  • NDC (National Drug Code). Initially, pharmacy data from the tables pharmacy, eMAR, and prescriptions is merged, then unified based on their shared columns through a series of left joins. Next, the NDC codes are exported from these tables, resulting in a total of 5732 codes. Using web-based data crawling techniques with platforms like NDC List [19], HIPAASpace [20], and FDA Report [21], this number is expanded to 7376 by incorporating additional package details. Out of these codes, 6265 were verified with accurate product and package information, while 1087 required manual searches using resources such as GSN NDC, RxNorm, and proprietary databases. A small set of 24 codes remained unverified but were corrected using additional information from the related tables.
  • Lab tests related to Chemistry, Hematology, and Blood Gas: We consolidated the Chemistry, Hematology, and Blood_Gas tables into a single table that unifies the different categories of chemistry, hematology, and blood gas data. After this consolidation, key columns such as category, fluid, and label were exported for further refinement. The next phase involved manually refining the data by modifying the fluid descriptions into more specific categories, such as classifying blood into subcategories like Whole Peripheral Blood Sample analysis and Plasma Peripheral Blood Sample analysis. Additionally, the label column was enhanced by adding short synonyms, units of measurement, and defining normal ranges for the values, making the data more user-friendly and comprehensive for analysis. Once the refinement was complete, the processed dataset was imported back into the project.

Sub-step 4: Table Segmentation and Merging

In this step, we segment the tables based on the type of timestamp column and the values related to the nature of the data in the tables. For example, the POE table was converted into ten tables based on different values in its field_name column. Additionally, some tables are merged together. For example, the first_day_bg_time and first_day_bg_art_time tables are merged by adding a column to the combined table, with a value of "NA" for first_day_bg_time records and "ART" for first_day_bg_art_time records.

Sub-step 5: Integration of Activities

In this step, we utilize the preprocessed data from previous steps to identify and catalog clinical activities. We have identified 95 distinct activities, such as Admission_Order, Admission_to_Careunit, antibiotic_administrations, Arterial_blood_sample_analysis_hemoglobin, Coagulation, BP_Measurement, Consultation and others. Some activities are obtained by merging tables, some by splitting, and some without any changes to the tables. The results are organized into two tables for each activity instance.

The first table, the Activity Instances table, records each occurrence of an activity along with its specifics and contains the following columns:

  • subject_id.
  • hadm_id.
  • Timestamps: The specific time at which the activity was recorded.
  • Activity: The name of the activity, such as Blood Gas Test or BP measurement.
  • Activity_Synonym: Contains abbreviations of activity labels, e.g., BGT for Blood Gas Test.
  • Activity_Properties_ID_aggregation: A unique foreign key ID for each distinct feature and its value.
  • Activity_Value_ID: A unique foreign key identifier for each distinct activity instance based on its combined features and values.

The second table, the Activity Properties table, details the properties associated with each activity instance and contains the following columns:

  • Activity_Properties_ID: Links back to the first table as a foreign key.
  • Activity.
  • Activity_Synonym.
  • featureName: The attribute name of the feature.
  • featureValue: The numeric or descriptive value.

Sub-step 6: Creating the Final Tables

In this sub-step, we create all the tables related to the event log. These tables are listed below:

  • Event_Log: This table contains the main event log.
  • ActivityAttributes: This table holds the event activities along with their associated activity feature tables.
  • ActivitiesDomain: This table includes the domain of event activities, which was manually curated.
  • CNM3: This table is part of the Constrained Node Mapping (CNM) used to construct the CEKG. CNM functions help build relationships between nodes with different labels, and they can be derived from data, domain knowledge, expert input, documents, and sometimes machine learning models [9]. The CNM3 table manually maps event activities to SNOMED-CT concepts using the SNOMED-CT Browser.
  • CNM4_1: Another CNM table that manually maps event activities to activity domains.
  • CNM4_2: This CNM table maps the domain of activities to SNOMED-CT concepts manually, also using the SNOMED-CT Browser.

Step 4: Creation of Tables Related to ICD Codes

Sub-step 1: ICD Code Correction

The MIMIC-IV dataset contains several issues related to its ICD codes, including a single column with 25K non-comparable codes, incorrect formatting, a mix of outdated ICD-9 and ICD-10 codes, and inconsistent levels of ICD code granularity. To address these issues, the goal is to create two columns: one with 1,760 high-level ICD codes and another with 19K comparable ICD codes. The ICD codes will be reformatted for accuracy (e.g., “253.5” for Diabetes insipidus), ensuring only ICD-10 codes are used, with a structure that supports hierarchical relationships (e.g., “E11” for diabetes type 2 and “E11.329” for diabetes type 2 with macular edema). The process involves exporting all ICD codes from MIMIC-IV and correcting them through automated data crawling and manual comparison with the ICD10Data website [22].

Sub-step 2: Implementing ICD Code Corrections

Using the ICD correction mapper that we built in sub-step 1, we correct the ICD codes of the MIMIC-IV.

Sub-step 3: Creating the Final Tables

  • ICD: This is the final ICD table with columns icd_code, icd_version, and icd_code_title that we need for the CEKG framework.
  • CNM1: This is a CNM that maps Disorders_ID to ICD_Codes. The CEKG framework assumes that, in some cases, disorders may not be mapped to an ICD code, making this CNM essential for the framework to function. Since disorders in MIMIC-IV are coded in ICD, we created hypothetical IDs for each disorder.
  • CNM2: This CNM maps the ICD code to SNOMED-CT. For mapping ICD code to SNOMED-CT, we use the OHSI mapper built in the input section. However, we have to manually map 8 out of 1760 ICD codes to SNOMED-CT.
  • CNM5 Tables: In the CEKG framework, CNM5 maps events (using the Activity_Instance_ID) to new entities, such as patient disorders. However, since the MIMIC-IV dataset lacks direct relationships between events and disorders, this mapping cannot be performed directly. To overcome this, one approach is to leverage domain knowledge and expert input, while another option is to apply machine learning algorithms. In this sub-step, we created tables to facilitate the creation of CNM5.

Step 5: Creation of Tables Related to Entity Activities

Sub-step 1: Attributes Preparation

Each entity can have several attributes, which can either be used as entities themselves or solely as attributes. For example, age, gender, and admission sequence are attributes of the Patient entity, as each patient has an age, a gender, and an admission sequence. Additionally, multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity are attributes of the Admission entity. Similarly, each disorder is an attribute of multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity. These attributes can also have relationships. In this sub-step, we prepare these attributes and their relationships.

Sub-step 2: Creating the Final Tables

  • EntitiesAttributes: We create the final entities attributes table using the data prepared in the previous sub-step.
  • EntitiesAttributeRel: We create the final entities attributes relationship table using the data prepared in the previous sub-step.

Step 6: Creation of Tables Related to SNOMED-CT Codes

Sub-step 1: Refining SNOMED-CT Data

In this sub-step, we refine the integration of SNOMED-CT data by focusing on the descriptions and relationships in the input step. We specifically consider those SNOMED-CT codes that were mapped from activities, domains, and ICDs, including all related SNOMED-CT codes up to the root concept. This process deepens the dataset by linking each SNOMED-CT code used or mapped in the project to its root concept through hierarchical relationships. This linkage is essential for structured clinical data analysis and ensures that only the necessary SNOMED-CT codes are imported, avoiding the need to import the entire SNOMED-CT system. Additionally, in this sub-step, we add a column labeled level, which is an index used to show the distance of a SNOMED-CT ID from the root SNOMED-CT ID (138875005). Sometimes, there are different paths to navigate from a SNOMED-CT ID to the root SNOMED-CT ID, so it may have more than one level. This index facilitates and enhances the speed of queries.

Sub-step 2: Creating the Final Tables

  • SCT Node: We create the final SNOMED-CT concept table using the data prepared in the previous sub-step.
  • SCT Rel: We create the final table that shows the relationship between SNOMED-CT concepts using the data prepared in the previous sub-step.

Step 7: Creation of Cluster Reference Tables

In some use cases, we do not need to use the entire dataset; we need to select part of the dataset that satisfies our need. In this step, we create two tables that may facilitate clustering of the dataset.

  • ClusterReference1: This table includes patients with their number of admissions, number of disorders, gender, anchor age, and life status.
  • ClusterReference2: This table includes patients with their disorders (using ICD codes), number of disorders, and number of admissions.

Step 8: Publishing the Dataset

In this step, we rename some tables and their columns for consistency and matching to the CEKG framework. For more detailed information about the final dataset tables, including how to access them, the meaning of each table and column, and their use cases, please refer to the next section.


Data Description

MIMIC-IV-Ext-CEKG contains one module, that contains 19 tables that extracted from MIMIC-IV dataset. In the following we describe each of these tables.

B_EventLog

This table is the event log, which can be either a single-entity or multi-entity event log. Entities represent distinct existences. Sometimes, the terms “case notion,” “case,” “object,” and “dimensional” are used interchangeably. The term "multi-entity event log" is sometimes considered equivalent to “object-centric event log” or “multi-dimensional event log.” In the multi-entity event log definition, each entity is defined with its origin and IDs. The table contains several columns:

  • Event_ID: Contains the ID of each event.
  • Timestamp: Contains the time and date of activities.
  • Activity: Consists of the activity label of the event.
  • Activity_Synonym: Contains abbreviations of activity labels. For example, BGT for Blood Gas Test.
  • Activity_Attributes_ID: A unique foreign key ID for each distinct feature and value. For example:
    • po2=295Activity_Attributes_ID=1
    • lactate=3.23Activity_Attributes_ID=2
    • Blood pressure=137/79Activity_Attributes_ID=3
    • po2=412Activity_Attributes_ID=4 (same feature but different value, so a different ID)
    • lactate=0.73Activity_Attributes_ID=5 (same feature but different value, so a different ID)
    • po2=295Activity_Attributes_ID=1 (same feature and same value, so the same ID)
    • lactate=3.23Activity_Attributes_ID=2 (same feature and same value, so the same ID)
  • Activity_Instance_ID A unique foreign key identifier for each distinct activity, considering its features and values. For example:
    • First event: Blood Gas Test: po2=295, lactate=3.23Activity_Instance_ID=1
    • Second event: BP_measurement: Blood pressure=137/79Activity_Instance_ID=2 (different activity from the first event)
    • Third event: Blood Gas Test: po2=412, lactate=0.73Activity_Instance_ID=3 (same activity as the first event but with different feature values)
    • Fourth event: Blood Gas Test: po2=295, lactate=3.23Activity_Instance_ID=1 (same activity as the first event with the same feature values)
  • Entity1_origin and Entity1_ID: Contain the origin and ID of each instance of the first entity representing the ID of each patient, equivalent to subject_id in MIMIC.
  • Entity2_origin and Entity2_ID: Contain the origin and ID of each instance of the second entity representing the ID of each admission, equivalent to hadm_id in MIMIC.
  • Entity3_origin and Entity3_ID: Contain the origin and ID of each instance of the third entity representing the ID of each outpatient encounter. This ID is newly created for each distinct outpatient encounter for a patient.
  • Entity4_origin and Entity4_ID: Contain the origin and ID of each instance of the fourth entity representing the ID of each Admission_Sequence. This ID is consistent for the patient; for example, for all patients, the first admission is 1, the second is 2, and so on.
  • Entity5_origin and Entity5_ID: Contain the origin and ID of each instance of the fifth entity representing the ID of each Outpatient_Sequence. This ID is consistent for the patient; for example, for all patients, the first outpatient visit is 1, the second is 2, and so on.
  • temp_patient_id and temp_encounter_id: The temp_patient_id stores the patient IDs, while the temp_encounter_id contains both admission and outpatient IDs. These columns are used when we want to analyze a subset of the data, allowing us to select specific patients, admission IDs, or outpatient IDs, typically after clustering the patients. It is important to note that, whether we use a portion of the table or the entire dataset, these two columns should be excluded from the final event log when building the CEKG.

C_EntitiesAttributes

This table contains the attributes of our entities. Each entity can have several attributes, which can either be used as entities themselves or only as attributes.

For example, age, gender, and admission are attributes of the Patient entity, as each patient has an age, gender, and admission sequence. Additionally, multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity are attributes of the Admission entity. Similarly, each disorder is an attribute of multimorbidity, treated multimorbidity, untreated multimorbidity, and new multimorbidity.

  • Origin: This column shows the type of attribute.
  • ID: This column shows the ID of the attribute.
  • Name: This column contains a mix of synonyms for origins and IDs.
  • Value: This column contains the value of the attribute, if it exists.
  • Category: This column has the value "absolute" for all attributes that are only used for data analysis.
  • temp_patient_id and temp_encounter_id: These serve the same purpose as in B_Eventlog table.

D_EntitiesAttributeRel

This table shows the relationship between entities and their attributes.

  • Origin1: This column contains the origin of the first entity or entity attribute.
  • ID1: This column contains the ID of the first entity or entity attribute.
  • Origin2: This column contains the origin of the second entity or entity attribute.
  • ID2: This column contains the ID of the second entity or entity attribute.
  • temp_patient_id and temp_encounter_id: These serve the same purpose as in B_Eventlog table.

E_ActivityAttributes

This table shows the activity attributes.

  • Activity_Attributes_ID: This column contains a foreign key that relates to the event log table.
  • Activity: This column shows the activity, corresponding to the Activity column in the event log table.
  • Activity_Synonym: This column shows the synonym for the activity, with a corresponding column of the same name in the event log table.
  • Activity_Attribute: This column contains the attributes.
  • Activity_Attribute_Value: This column contains the values of the attributes.
  • temp_patient_id and temp_encounter_id: These serve the same purpose as in B_Eventlog table.

F_ActivitiesDomain

The F_ActivitiesDomain table contains the activity domains and consists of only one column.

G_ICD

This table contains a subset of our ICD codes.

  • ICD_Origin: This column contains values for all ICD entries. It is an auxiliary column used solely for data analysis.
  • ICD_Code: This column shows the ICD codes.
  • ICD_Version: This column shows the version of the ICD codes.
  • ICD_Code_Title: This column shows the titles of the ICD codes.

H_SCT_Node

This table contains a subset of our SNOMED-CT concept codes.

  • SCT_ID: This column contains the SNOMED-CT ID.
  • SCT_Code: This column is an auxiliary column used in this table, not related to SNOMED-CT terminology.
  • SCT_DescriptionA_Type1: This column shows the description of SNOMED-CT IDs with their semantic tag in parentheses.
  • SCT_DescriptionA_Type2: This column shows the description of SNOMED-CT IDs without their semantic tag in parentheses.
  • SCT_DescriptionB: This column shows another description of SNOMED-CT IDs, which exists only for some of them.
  • SCT_Semantic_Tags: This column contains the semantic tags of SNOMED-CT IDs.
  • SCT_Type: This column categorizes SNOMED-CT into three categories: root (only one ID, 138875005), top-level concept (we have 18 SNOMED-CTs), and concept (all other IDs besides root and top-level concepts).
  • SCT_Level: This index shows the distance of a SNOMED-CT ID from the root SNOMED-CT ID (138875005). A SNOMED-CT ID may have more than one level.

H_SCT_REL

This table shows the relationships between SNOMED-CT concepts.

  • SCT_ID_1: The ID of the first SNOMED-CT concept node.
  • SCT_Code_1: The code of the first SNOMED-CT concept node.
  • SCT_ID_2: The ID of the second SNOMED-CT concept node.
  • SCT_Code_2: The code of the second SNOMED-CT concept node.

I_CNM1

This table shows the constrained node mappings derived from the MIMIC-IV dataset, which relate each Disorders ID (an attribute of multimorbidity) to each ICD code.

  • Disorder_ID: This column shows the disorder attribute identifier.
  • ICD_Code: This column contains the ICD code.

J_CNM2

This table shows the constrained node mappings derived from "OHDSI Athena" for relating ICD codes to SNOMED-CT.

  • ICD_Code: This column contains the ICD codes.
  • SCT_ID: This column contains the SNOMED-CT IDs.

K_CNM3

This table shows the constrained node mappings derived manually by searching to relate activities to SNOMED-CT concepts.

  • Activity: This column shows the activity, corresponding to the "Activity" column in the event log table.
  • Activity_Synonym: This column shows the synonym for the activity, with a corresponding column of the same name in the event log table.
  • SCT_ID: This column contains the SNOMED-CT IDs.
  • SCT_Code: This column contains the SNOMED-CT codes.

L_CNM4_1

This table shows the constrained node mappings derived manually by searching to relate activities to domains.

  • Activity: This column shows the activity, corresponding to the Activity column in the event log table.
  • Activity_Synonym: This column shows the synonym for the activity, with a corresponding column of the same name in the event log table.
  • Activity_Domain: This column shows the domain of activities.

L_CNM4_2

This table shows the constrained node mappings derived manually by searching to relate the domain of activities to SNOMED-CT concepts.

  • Activity_Domain: This column shows the domain of activities.
  • SCT_ID: This column contains the SNOMED-CT IDs.
  • SCT_Code: This column contains the SNOMED-CT codes.

M_CNM5_Activity_Instance_ID

This table contains a list of Activity_Instance_IDs.

  • Activity_Instance_ID: This column contains the activity instance identifiers. This foreign key can be linked to the event log table.

M_CNM5_Activity_Instance_ID_with_features_Entities

This table contains the features related to Activity_Instance_ID.

  • Activity_Instance_ID: This column contains the activity instance identifiers. This foreign key can be linked to the event log table.
  • Activity: This column shows the activity, corresponding to the "Activity" column in the event log table.
  • Activity_Synonym: This column shows the synonym for the activity, corresponding to the "Activity_Synonym" column in the event log table.
  • Activity_Attribute: This column contains the attributes.
  • Activity_Attribute_Value: This column contains the values of the attributes.
  • Entity1_ID: Contains the ID of each patient, equivalent to subject_id in MIMIC.
  • Entity2_ID: Contains the ID of each admission, equivalent to hadm_id in MIMIC.

M_CNM5_Activity_Instance_ID_with_class_Entities

This table contains the classes related to Activity_Instance_ID.

  • Activity_Instance_ID: This column contains the activity instance identifiers. This foreign key can be linked to the event log table.
  • Disorder_Name: This column contains the name of disorder attributes.
  • Entity1_ID: Contains the ID of each patient, equivalent to subject_id in MIMIC.
  • Entity2_ID: Contains the ID of each admission, equivalent to hadm_id in MIMIC.

M_CNM5_class

This table shows the relationship between Disorder_Name and Disorder_ID.

  • Disorder_Name: This column contains the name of disorder attributes.
  • Disorder_ID: This column contains the identifiers of disorder attributes.

N_ClusterReference1

This first table that can be used for clustering the patients.

  • temp_patient_id: It stores the patient IDs.
  • Entity1_ID: Contains the ID of each patient, equivalent to subject_id in MIMIC.
  • Morbid_num: Number of disorders across all admissions.
  • Admission_num: Number of admissions.
  • gender: Gender of patients.
  • anchor_age: Anchor age of patients.
  • dod: Mortality status of patients (whether they died or not).

N_ClusterReference2

This second table that can be used for clustering the patients.

  • temp_patient_id: It stores the patient IDs.
  • temp_encounter_id: It can contain both admission and outpatient IDs.
  • Entity1_ID: Contains the ID of each patient, equivalent to subject_id in MIMIC.
  • Entity2_ID: Contains the ID of each admission, equivalent to hadm_id in MIMIC.
  • ICD10_Code: This column shows the ICD-10 codes.
  • ICD10_Code_Root: This column shows the root concept of the ICD-10 codes.
  • ICD10_Code_title: This column shows the titles of the ICD-10 codes.
  • ICD10_Code_Root_title: This column shows the titles of the root concept of the ICD-10 codes.
  • Morbid_num: Number of disorders across all admissions.
  • Admission_num: Number of admissions.

Usage Notes

The MIMIC-IV-Ext-CEKG event log, generated in this study, is derived from various modules of the MIMIC-IV dataset. Therefore, authorized access to the MIMIC-IV-Ext-CEKG dataset is essential to effectively utilize the MIMIC-IV-Ext-CEKG.

Although this dataset is specifically designed for constructing clinical event knowledge graphs and working with tools developed for that purpose, it can also be applied to a wide range of process mining tools. For instance, the B_EventLog and E_ActivityAttributes tables can be employed in any process mining or data mining tasks. Moreover, since the MIMIC-IV dataset is extensively utilized in data mining and machine learning applications, the MIMIC-IV-Ext-CEKG, being derived from MIMIC-IV, can likewise be employed for data mining and machine learning tasks.

However, the possibility of training machine learning models on MIMIC-IV-Ext-CEKG has not yet been examined. Tables such as M_CNM5_Activity_Instance_ID, CNM5_Activity_Instance_ID_with_features, CNM5_Activity_Instance_ID_with_class, and CNM5_class were created to facilitate this. Additionally, these tables can be adapted for training time series analysis models by incorporating timestamps using B_EventLog table.

The MIMIC-IV-Ext-CEKG also addressed several existing data cleaning issues in MIMIC-IV:

  • Corrected ICD-9 codes and converted them to ICD-10, as ICD-9 is outdated, while ICD-10 offers an improved and updated hierarchical structure that can be converted to ICD-11.
  • Corrected ICD-10 codes based on titles.
  • Corrected all NDC codes in the "prescriptions" table to identify repetitive codes, enhance comparability and searchability, and accurately capture product and package codes, proprietary names, dosages, and route name. This also improved connections with GSN and product codes.
  • Merged the labevent and d_labitems tables, then corrected all fluid names, label names, units, and normal ranges. Fluid descriptions were modified to reflect narrower, more specific categories than the original labels. For example, blood was categorized as "Peripheral Blood Sample Analysis - Whole," "Peripheral Blood Sample Analysis - Plasma," etc. Additionally, short synonyms were added to labels, units of measurement were included, and normal ranges were defined.

While the MIMIC-IV-Ext-CEKG dataset enhances the original MIMIC-IV, but it is essential for users to be aware of certain limitations that may impact their research. One key issue is the continuation of the practice from the original dataset, where MIMIC dates are altered by applying a patient-specific offset. This alteration could affect the analysis of precise timing in relation to external events or records. Additionally, the dataset is curated to support process-oriented analyses by including only data from certain MIMIC-IV modules (mimiciv_hosp, mimiciv_icu, and mimiciv_derived modules) that provide time and date details. This selective inclusion may leave out other significant activities that lack precise timestamps, thereby reducing the overall breadth of clinical activities represented in the dataset. Understanding these constraints is vital for users to effectively navigate potential challenges when working with the MIMIC-IV-Ext-CEKG dataset.


Ethics

MIMIC-IV-Ext-CEKG is extracted from MIMIC-IV, MIMIC-IV-ED, and MIMIC-IV-Note and exists under the same IRB as they do.


Acknowledgements

This dataset is built upon the MIMIC-IV dataset, and we extend our deepest gratitude to the authors and creators of MIMIC-IV for their invaluable work. We would also like to express special thanks to Alistair Johnson for his guidance and for answering questions related to the possibility of creating this dataset based on MIMIC-IV.


Conflicts of Interest

The author(s) declare no conflicts of interest.


References

  1. Xiong X, Li VJ, Huang B, Huo Z. Equality and social determinants of spatial accessibility, availability, and affordability to primary health care in Hong Kong, a descriptive study from the perspective of spatial analysis. BMC Health Serv Res. 2022;22(1):1364.
  2. Caballo B, Dey S, Prabhu P, Seal B, Chu P. The effects of socioeconomic status on the quality and accessibility of healthcare services. Across The Spectrum of Socioeconomics: Issue IV. 2021;236:1-16.
  3. Marengoni A, Angleman S, Melis R, Mangialasche F, Karp A, Garmen A, Meinow B, Fratiglioni L. Aging with multimorbidity: a systematic review of the literature. Ageing Res Rev. 2011;10(4):430-9.
  4. Soley-Bori M, Ashworth M, Bisquera A, Dodhia H, Lynch R, Wang Y, Fox-Rushby J. Impact of multimorbidity on healthcare costs and utilisation: a systematic review of the UK literature. Br J Gen Pract. 2021;71(702):e39-e46.
  5. Munoz-Gama J, Martin N, Fernandez-Llatas C, Johnson OA, Sepúlveda M, Helm E, Galvez-Yanjari V, Rojas E, Martinez-Millana A, Aloini D, et al. Process mining for healthcare: Characteristics and challenges. J Biomed Inform. 2022;127:103994.
  6. Van Der Aalst W, van der Aalst W. Data science in action. Springer; 2016.
  7. Van der Aalst WMP. Object-centric process mining: dealing with divergence and convergence in event data. In: Software Engineering and Formal Methods: 17th International Conference, SEFM 2019; 2019 Sep 18-20; Oslo, Norway. Springer; 2019. p. 3-25.
  8. Esser S, Fahland D. Multi-dimensional event data in graph databases. J Data Semant. 2021;10(1):109-41.
  9. Naeimaei Aali M, Mannhardt F, Toussaint PJ. Clinical Event Knowledge Graphs: Enriching Healthcare Event Data with Entities and Clinical Concepts-Research Paper. In: Proceedings of the International Conference on Process Mining; 2023. Springer; p. 296-308.
  10. Naeimaei Aali M, Mannhardt F, Toussaint PJ. The CEKG: A Tool for Constructing Event Graphs in the Care Pathways of Multi-Morbid Patients. In: Proceedings of the Doctoral Consortium and Demo Track at the International Conference on Process Mining 2024, co-located with the 6th International Conference on Process Mining (ICPM 2024); 2024. CEUR-WS.org; vol 3783, p. 6.
  11. Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.
  12. Johnson A, Bulgarelli L, Pollard T, Gow B, Moody B, Horng S, Celi L A, Mark R. MIMIC-IV (version 3.1). PhysioNet. 2024. Available from: https://doi.org/10.13026/kpb9-mt58.
  13. Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
  14. Naeimaei Aali M. MIMIC-IV-Ext-CEKG [Internet]. Available from: https://github.com/mnaeimaei/MIMIC-IV-Ext-CEKG
  15. Observational Health Data Sciences and Informatics (OHDSI). The OHDSI Collaborative [Internet]. Available from: https://ohdsi.org
  16. SNOMED International, College of American Pathologists. SNOMED CT [Internet]. 2021. Available from: https://www.snomed.org/
  17. National Library of Medicine. SNOMED CT International [Internet]. Available from: https://www.nlm.nih.gov/healthit/snomedct/international.html
  18. SNOMED International. SNOMED CT Browser [Internet]. Available from: https://browser.ihtsdotools.org/
  19. National Drug Code List. NDC List [Internet]. Available from: https://www.ndclist.com/
  20. HIPAA Space. HIPAA Space [Internet]. Available from: https://www.hipaaspace.com/
  21. FDA. FDA Report [Internet]. Available from: https://fda.report/
  22. ICD10Data.com. ICD-10 Data [Internet]. Available from: https://www.ICD10Data.com/

Parent Projects
MIMIC-IV-Ext-CEKG: A Process-Oriented Dataset Derived from MIMIC-IV for Enhanced Clinical Insights was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files