Database Credentialed Access
Northwestern ICU (NWICU) database
Dana Moukheiber , William Temps , Bhadrappa Molgi , Yikuan Li , Alice Lu , Prasanth Nannapaneni , Abdulrahman Chahin , Sicheng Hao , Felipe Torres Fabregas , Leo Anthony Celi , Adrian Wong , Maxwell Lloyd , Xavier Borrat Frigola , Hyung-Chul Lee , Daniel Schneider , Tom Pollard , Yuan Luo , Abel Kho , Roger Mark
Published: Nov. 19, 2024. Version: 0.1.0
When using this resource, please cite:
(show more options)
Moukheiber, D., Temps, W., Molgi, B., Li, Y., Lu, A., Nannapaneni, P., Chahin, A., Hao, S., Torres Fabregas, F., Celi, L. A., Wong, A., Lloyd, M., Borrat Frigola, X., Lee, H., Schneider, D., Pollard, T., Luo, Y., Kho, A., & Mark, R. (2024). Northwestern ICU (NWICU) database (version 0.1.0). PhysioNet. https://doi.org/10.13026/s84w-1829.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Retrospective medical data collection is essential for advancing patient care, offering insights and supporting the development of health technology. The Medical Information Mart for Intensive Care (MIMIC)-III database has been instrumental in providing comprehensive critical care data to facilitate research efforts. Building upon the foundation of MIMIC-III, MIMIC-IV introduced a modular data structure and fostered an open-source community to enhance the data. Here we introduce Northwestern ICU database, a critical care dataset collected from a network of hospitals located in Chicago and harmonized with the MIMIC data structure.
Background
Northwestern Medicine (NM) is a network of twelve hospitals located in Chicago and the surrounding area. NM originally used a variety of electronic medical record (EMR) systems across the network, but in 2018 migrated all the hospitals to the same EMR platform, Epic. As an essential element of routine medical care, the EMR collects data on patients, admissions, diagnoses, patient status, procedures, medications, and all the other aspects of patient care.
To make EMR data available for research and quality assurance, the EMR system feeds selected data into a relational Enterprise Data Warehouse (NM EDW). The NM EDW was the source for all data supplied by NM. In order to protect patient privacy, the EDW is accessible only to the technical personnel who maintain it, database analysts affiliated with NM, and selected researchers with CITI certification in human-subjects research working on IRB-approved research studies. Raw data extracted from the EDW is stored on encrypted file servers protected by firewalls and accessible only to authorized NM and Northwestern University (NU) personnel.
NM actively pursues the use of the valuable trove of EMR data for research purposes. Recent advances in machine learning and predictive modeling, for example, have been the focus of numerous projects between Northwestern Medicine, the Feinberg School of Medicine, and other departments within Northwestern University. In general, more data means better and more accurate results, and data from any single medical center has inherent limits. To expand these limits and realize greater benefits from collected data, NM has previously joined cooperative organizations including the Center for Health Information Partnerships (CHIP) [1], the CAPriCORN [2] network of 32 hospitals, and the All of Us research program (previously known as the Precision Medicine Initiative).
By building the Northwestern ICU database, NM expands the collaborative network to provide an even broader and geographically diverse resource for researchers, particularly in the context of understanding and combating the COVID-19 pandemic. Furthermore, the adoption of standard terminologies for data coding enhances data interoperability, facilitating seamless data exchange and collaboration among healthcare institutions and researchers. While there are already several public ICU datasets available for research, including HIRID [4] and the Salzburg database, [5] these do not follow the MIMIC structure.
Methods
We introduce Northwestern ICU (NWICU) database, a large, freely available harmonized COVID-rich ICU database comprising de-identified health-related data from Northwestern Memorial HealthCare (NMHC) from 2020 to 2022. The collaborative project is established between MIT and Northwestern University (NU), with NU granted access to NMHC medical data through its affiliation with the Northwestern Medicine hospital network. It offers a valuable addition to the research community's resources [6].
Approval for participation in the project was obtained from the NU Institutional Review Board (IRB), which granted a waiver of informed consent due to the minimal risk involved and the impracticality of obtaining specific consent from the large patient population. Additionally, the NM Data Steward authorized the release of data from over 25,000 patients from the NM Enterprise Data Warehouse (EDW). Understanding the sensitivity of shared data, NM has implemented rigorous technical and administrative controls to safeguard security and patient privacy.
The creation of the data was curated through a three-step process:
1. Acquisition
The NM EMR uses its own proprietary non-relational database management system. A set of Extract, Transform, and Load (ETL) processes transfer selected data from the EMR into a separate relational EDW daily. The EDW, as a relational database, comprises numerous tables, organized into separate databases (schemas) for administration and access control. With appropriate access permissions, these databases are interoperable, allowing tables from one database to be linked with tables in other databases.
In alignment with common data warehouse practices, and as part of the 2018 restructuring, the NM EDW reorganized its tables to facilitate research, in contrast to the primarily transactional nature of the EMR, EDW tables are categorized as Fact and Dimension. Fact tables primarily store events (such as encounters, admissions, diagnosis events, procedure orders, and medication orders), while Dimension tables describe persistent attributes of entities (patients, procedure names, medication formulary). Additionally, the NM EDW includes auxiliary tables not directly related to patient care, such as a list of International Classification of Disease codes (ICD-9 and ICD-10). In response to the COVID-19/SARS-COV-2 pandemic, a COVID-19 data mart was established within the EDW providing convenient access to information about COVID-19 patients, lab tests, results, treatment, and more.
Clinical cohort: NM provided data on all patients admitted to an ICU and subsequently discharged in the years 2020 and 2021, regardless of their COVID-19 status. Patients were excluded if they were under 18 years old at earliest admission or if they were marked as restricted. The NM EDW includes a cohorts facility, consisting of a set of persistent tables, each defining a cohort of patients for research. A table in the cohorts facility provided the master patient list for this project. All other tables used inner join operations to limit their patient populations to those in the defined cohort. Patient data was extracted from the NM EDW using Structured Query Language (SQL) and Python.
Out-of-hospital mortality: At present there is no provision available to collect information on deaths occurring outside an NM hospital admission.
External data sources: During the development of MIMIC-IV, ICD-10 long titles were acquired from the Center for Medicare and Medicaid Service [10]. ICD codes acquired from CMS records each year were merged with the previously obtained ICD long titles from CMS, and newly added ICD-10 codes from 2020 to 2023 were included. NM EDW, in contrast, has an internal table defining all ICD-9 and ICD-10 diagnosis codes.
In addition to the NM EDW, NMHC data includes data from LOINC [6], which provides a standard system used to describe lab tests and vital signs. The RxNav API [8] was used to verify the RxNorm and NDC drug codes in the NM EDW and supply missing codes. SNOMED-CT codes were verified through Athena [9].
2. Transformation and Harmonization
The structure of the NM EMR system is dramatically different from the relational NM EDW, which in turn is substantially different from MIMIC. Data were extracted from various NM EDW tables and transformed both in structure and content in order to be compatible with the MIMIC data structures and definitions.
Efforts were made to identify data items using standard ontologies such as SNOMED CT [11] for procedures, RxNorm [12] and the US FDA National Drug Code (NDC) [13] for medications, and LOINC [7] for lab tests. In some cases, this was infeasible, for example because the names were ambiguous or could not be located in the target ontology.
3. De-identification
The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor method [14] was adopted for de-identification. This defines protected health information (PHI) and 18 categories of information which must be removed in order for a data set to be considered de-identified. The following were considered while building the NWICU database.
- Patients, admissions, and ICU stay identifiers are replaced by cryptographic random numbers. Each patient is assigned a unique
subject_id
; each hospital admission is assigned a uniquehadm_id
; and each ICU stay is assigned a uniquestay_id
. All of these are generated independently. - Each individual patient is assigned a random date shift value representing a number of days. Dates related to that patient are deidentified by adding this number of days to each date. The shift value is fixed and consistent for each individual but is independent between individuals. No provision was made to allow analyses of effects based on dates such as seasonal variation, weekends, or holidays.
- Some structured fields such as lab test results were found to be defined as free text which occasionally included protected information. These fields are subjected to filtering: dates are removed; numbers are allowed; and any remaining text is allowed only if it matches any of a set of regular expressions or is explicitly specified in a list of permitted text.
- Free-text fields such as procedure names are processed by a specialized subroutine to identify and remove any restricted information.
- While not considered protected health information (PHI) or personally-identifiable information (PII), hospital identifiers and health-care providers’ names are removed.
Data Description
Northwestern ICU database (NWICU)
NWICU retains the MIMIC-IV organizational structure with hospital module and ICU module and aligns its data structure closely. The primary goal of these modules is to emphasize the data's source or provenance. This database is configured to ensure no overlap among identifiers across institutions. The database includes instances and terminologies that were mapped to common standards across both institutions, promoting data interoperability for seamless analysis and research.
Hosp Module
The hospital module contains data sourced from the extensive Electronic Health Record (EHR) systems of both BIDMC and NMHC hospitals, including patient demographics, admission details, laboratory test results, billed diagnoses, prescriptions, and electronic medication administration records. The subject_id
serves as a unique identifier for each patient, consistently used to associate data with individual patients across all tables. It is generated as a cryptographic random number to ensure data privacy and integrity across institutions. The hadm_id
uniquely identifies each hospital admission, with duplicate subject_id
entries indicating multiple admissions for the same patient. Through the shared subject_id
identifier, the admissions table links admission records with patient profiles, allowing for a thorough review of a patient's hospitalization history in the dataset.
Lab events serve as repositories for event instances related to laboratory tests and measurements including SARS-CoV-2 tests. ICD-9/ICD-10 diagnostic codes including markers related to long COVID and COVID-19, are also linked to individual patient records and hospital admission information. This linkage enables the exploration and analysis of both short-term and long-term effects of COVID-19 on patients in the ICU. The hospital module includes dimension tables, prefixed with d_,
that facilitate the translation of in-house laboratory codes (item_ids) and ICD-9/ICD-10 billing diagnostic codes into clinically meaningful concepts.
We also provide a comprehensive list of prescribed medications for patients as part of the prescriptions table, which includes COVID-19 drugs like remdesivir and SARS-CoV-2 vaccines. The emar (Electronic Medication Administration Record) table extends the prescriptions table by not only detailing the prescriptions but also recording whether and when these medications were actually administered to patients. Once a medication order is entered into the EMR system, the order may be carried out as scheduled, or it may be delayed, modified, or canceled. This tracking is captured in the event_txt
field.
ICU Module
The ICU module houses data extracted from the clinical information system of both institutions utilized in the ICU, encompassing details such as ICU admissions, recorded procedures, and charted vital signs information.
Within this module, the icu stays table describes ICU stays within a patient's admission. Each patient's admission includes at least one ICU stay but it may include multiple stays. When the time gap between two ICU stays is less than 24 hours, these stays are consolidated into a single entry. The table also includes details on admission and discharge times in the ICU. along with care units, including a specialized ICU overflow unit. During the COVID-19 pandemic, ICU overflow denotes a situation where hospital intensive care units (ICUs) have exceeded their capacity due to a surge in COVID-19 patients requiring critical care. This overflow happens when the number of patients needing intensive care for COVID-19 exceeds the available ICU beds.
Additionally, the procedure events table captures instances of procedures performed during ICU stays, including those related to COVID-19 management, such as ventilation and chest X-ray imaging. In parallel, the chartevents table documents vital sign values collected throughout an ICU stay. The charted data encompass numeric vital signs such as heart rate, respiratory rate, and oxygen saturation in arterial blood by pulse oximetry. For both procedures and vital signs, specific concepts are identified by unique item_id
s, and these item_id
s are linked to the d_items table for detailed reference.
Standard Terminologies
We map labs, prescriptions, vital signs, and procedures to standard terminologies, including LOINC, RxNorm, and SNOMED, while also emphasizing relevant COVID-related terms and markers. The mappings follow the Simple Standard for Sharing Ontological Mappings (SSSOM) data model [15] and are featured on the MIMIC Code Repository [17]. In particular, COVID-related terminologies are indicated in the comments column following the SSSOM data model. Furthermore, COVID terminology and markers are featured on the MIMIC website [16], covering various aspects such as lab tests, medications, ICD-9/ICD-10 codes, care units, and common procedures.
Usage Notes
The data including standard terminology mappings within the dataset were collected as part of routine clinical practice and may contain idiosyncrasies that reflect the nature of clinical operations. Implausible values could potentially exist in the database as artifacts of the archival process. Researchers are strongly advised to adhere to best practice guidelines when conducting data analysis. For more comprehensive documentation on the data items, the transformation process, and mappings to standard terminologies such as LOINC, RxNorm, and SNOMED, as well as specific COVID-related markers or terms, detailed information is accessible on the MIMIC website [16]. This resource ensures thorough understanding and utilization of the dataset for COVID-related research and analysis. We will also share example analyses (JupyterNotebooks) along with the mappings on MIMIC Code Repository [17] to support ongoing research and improvement of standard ontologies. A crosswalk package (Python package) for seamless data integration is accessible on the crosswalk medical repository [18].
Release Notes
Version 0.1.0 is the first public release of the harmonized multi-center COVID-rich ICU database. The harmonized multi-center COVID-rich ICU database contains twelve tables. The hosp module includes the patients, admissions, d_icd_diagnoses, diagnoses_icd, labitems, labevents, prescriptions and emar tables. And the icu module includes icustays, d_items, chartevents and procedureevents tables. We release the NWICU dataset and provide build scripts on the MIMIC Code Repository. The scripts used to harmonize the MIMIC IV v3.0 dataset with the NWICU data to create MIMIC-NW 2020-2022, facilitating ongoing updates to MIMIC, will be released on the MIMIC Code Repository [17].
Ethics
The project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was de-identified.
The NU IRB approved participation in the project and granted a waiver of informed consent based on the minimal risk and impracticality of obtaining specific consent from the large number of patients. Release of data from the NM EDW was additionally approved by the NM Data Steward.
Acknowledgements
This project is funded by the NIH National Library of Medicine under contract number 75N97020C00013. We would like to thank the Beth Israel Deaconess Medical Center for their continued collaboration and support of MIMIC, as well as Northwestern Memorial HealthCare for their contributions.
Conflicts of Interest
The authors have no conflicts of interests to declare.
References
- Northwestern University, Feinberg School of Medicine. Center for Health Information Partnerships (CHIP). [Online]. Available: https://www.feinberg.northwestern.edu/sites/chip/index.html. Accessed October 10, 2023.
- Capricorn Collaborative Clinical Data Research Network. [Online]. Available: https://www.capricorncdrn.org/. Accessed October 10, 2023.
- All of Us Research Program, National Institutes of Health (NIH). [Online]. Available: https://allofus.nih.gov/. Accessed October 10, 2023.
- PhysioNet. HIRID - Hospital Intelligent Research Integrated Database, Version 1.1.1. [Online]. Available: https://physionet.org/content/hirid/1.1.1/. Accessed October 10, 2023.
- PhysioNet. SICDB - Synthetic Intensive Care Database, Version 1.0.6. [Online]. Available: https://physionet.org/content/sicdb/1.0.6/. Accessed October 10, 2023.
- Alberto IR, Alberto NR, Ghosh AK, Jain B, Jayakumar S, Martinez-Martin N, McCague N, Moukheiber D, Moukheiber L, Moukheiber M, Moukheiber S. The impact of commercial health datasets on medical research and health-care algorithms. The Lancet Digital Health. 2023 May 1;5(5):e288-94.
- LOINC (Logical Observation Identifiers Names and Codes). [Online]. Available: https://loinc.org/. Accessed October 10, 2023.
- National Library of Medicine (NLM). [Online]. Available: https://www.nlm.nih.gov/. Accessed October 10, 2023.
- ATHENA - OHDSI Vocabulary and NLP Data Repository. [Online]. Available: https://athena.ohdsi.org/search-terms/start. Accessed October 10, 2023.
- Centers for Medicare & Medicaid Services (CMS). [Online]. Available: https://www.cms.gov/. Accessed October 10, 2023.
- SNOMED International. [Online]. Available: https://www.snomed.org/. Accessed October 10, 2023.
- National Library of Medicine (NLM). RxNorm - National Library of Medicine. [Online]. Available: https://www.nlm.nih.gov/research/umls/rxnorm/index.html. Accessed October 10, 2023.
- U.S. Food and Drug Administration (FDA). National Drug Code (NDC) Directory. [Online]. Available: https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory. Accessed October 10, 2023.
- U.S. Department of Health & Human Services (HHS). De-Identification of Protected Health Information. [Online]. Available: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html. Accessed October 10, 2023.
- Mapping Commons. Schema and Ontology Matching Framework. Available: https://mapping-commons.github.io/sssom/. Accessed October 10, 2023.
- MIMIC - Medical Information Mart for Intensive Care. [Online]. Available: https://mimic.mit.edu. Accessed October 10, 2023.
- MIT-LCP. MIMIC Code Repository. [Online]. Available: https://github.com/MIT-LCP/mimic-code. Accessed October 10, 2023.
- MIT-LCP. cwmed. GitHub repository. [Online]. Available from: https://github.com/MIT-LCP/cwmed. Accessed 12 October 2023.
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 0.1.0):
https://doi.org/10.13026/s84w-1829
DOI (latest version):
https://doi.org/10.13026/4je9-db46
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project