Database Credentialed Access
Establishment of a Chinese critical care database from electronic healthcare records in a tertiary care medical center
Senjun Jin , Lin Chen , Kun Chen , Zhongheng Zhang
Published: Jan. 19, 2023. Version: 1.0
When using this resource, please cite:
(show more options)
Jin, S., Chen, L., Chen, K., & Zhang, Z. (2023). Establishment of a Chinese critical care database from electronic healthcare records in a tertiary care medical center (version 1.0). PhysioNet. https://doi.org/10.13026/901c-yv54.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
The medical specialty of critical care, or intensive care, provides emergency medical care to patients suffering from life-threatening complications and injuries. The medical specialty is featured by the generation of a huge amount of high-granularity data in routine practice. The data comprise hourly vital signs, ventilator waveforms, medical orders, and laboratory results. Currently, these data are well archived in the hospital information system for the primary purpose of routine clinical practice. However, data scientists have noticed that in-depth mining of such big data may provide insights into the pathophysiology of underlying diseases and healthcare practices. Clinical questions related to risk factors, predictive analytics, cost-effectiveness, and causal inference can be addressed with the critical care database. There have been several openly accessible critical care databases being established, which have generated hundreds of scientific outputs published scientific journals. However, such work is still in its infancy in China. China has a large patient population, which contributes to the generation of a large healthcare database in hospitals. The establishment and sharing of such a database can help to promote the open data science and discover novel scientific knowledge in a collaborative way. In this data descriptor article, we report the establishment of an openly accessible critical care database generated from hospital information system. The database comprises 8180 unique hospital admissions for 7638 individual patients from January 2012 to May 2022. The database contains 11 plain text relational tables that can be linked by hospital_ID.
Background
Critically ill patients managed in the intensive care unit (ICU) are usually monitored closely for organ dysfunctions, and are treated intensively by a variety of supportive modalities [1,2]. Vital signs, laboratory tests and medical treatments were adjusted at a higher frequency than those treated in the general ward. Such daily intensive management will produce huge amount of information including medical orders, imaging studies, laboratory findings and waveform signals. The data generation mechanisms may reflect key factors related to the healthcare system, pathophysiology of underlying disease, and patients’ preferences and cultures [3]. Thus, in-depth data mining of such large databases, such as risk factor analysis, predictive analytics and causal inference [4,5], can provide more insights into clinical research questions. More knowledge or wisdoms can be obtained from data mining, and the translation of the knowledge into clinical practice may potentially improve clinical outcomes [6,7].
Most published scientific reports do not make their original raw data freely accessible in current critical care research community, partly attributable to the confidentiality issues. The unwillingness to share data makes it difficult to reproduce the reported results. Furthermore, exploration of such large database from a single research group could be biased and limited. Thus, strenuous efforts have been made to encourage the scientific community to share their raw data, which is also supported by the open data campaign [8,9]. Several openly accessible critical care databases have been established, mainly reflecting the healthcare systems of western countries [10-12]. China is a large country with huge number of patient population, with special hospital information systems that are distinct from those from western countries. However, hospital information systems in Chinese hospitals are mainly used for clinical practice and are far less developed for research purposes. Data sharing is still in its infancy in the Chinese critical care community, which significantly impairs the transparency of scientific work and international collaborations. To the best of our knowledge, there are two critical care databases being established in China which focus on pediatric critically ill patients and those with infections[13,14]. Here, we reported the establishment of a large critical care database comprising high-granularity data generated from the information system of a tertiary care university hospital. Details of the database are reported in the paper to encourage new researches through secondary analysis of the database.
Methods
Study setting and population
The study was conducted in Zhejiang Provincial People's Hospital, Zhejiang, China from January 2012 to May 2022. All patients admitted to the ICU of the hospital were eligible. There were two ICUs in the hospital: one was the comprehensive central ICU and the other was the emergency ICU (EICU). There was no exclusion criterion in enrolling subjects because we believed that patients who were excluded by a particular study might be eligible for another study. Thus, we included all records in the information system related to ICU stays. The study was approved by the ethics committee of Zhejiang Provincial People's Hospital (approval number: QT2022185). Informed consent was waived as determined by the institutional review board, due to the retrospective design of the study. The study was conducted in accordance with the Declaration of Helsinki.
Database structure and development
The database is distributed as comma separated value (CSV) files that can be imported to any relational database system. We recommend the R package tidyverse for the management of the relational database because of its capability to streamline the workflow from data management to statistical analysis and to the training of machine learning models[15]. Each file contains a single table which will be further explained in the subsequent sections. For large files, we recommend the data.table package to process the tabular data.
Each individual subject can be identified by a series number (patient_SN) with the combination of digits and letters such as “3c74cf74c36241b7082ec35e458279dc”. Each unit hospital stay is denoted by a Hospital_ID with examples such as "ZY|360812" and "IP|20190500469". The unique ICU stay can be identified by the HospitalTransfer table, which contains intrahospital transfer events for the subjects. All tables use Hospital_ID to identify an individual hospital stay, and the HospitalTransfer table can be used to determine ICU stays linked to the same patient and/or hospitalization.
Deidentification
All tables are deidentified according to the Health Insurance Portability and Accountability Act (HIPAA). All protected information are removed including addresses, date of birth, date of hospital admission, date of discharge, date of medical order, personal numbers (e.g. phone, social security and hospital number), exact age on admission. When creating the dataset, patients were randomly assigned a unique identifier (patient_SN and hospital_ID) and the original hospital identifiers were not retained. As a result, the identifiers in the database cannot be linked back to the original, identifiable data. All doctor/nurse/pharmacist identifiers have also been removed to protect the privacy of contributing providers. The dates in the free text were replaced with asterisks. Other information that may help to identify individual patients such as names, IDs, addresses, phone number, dates were removed or replaced with asterisk symbols.
Data Description
The database comprises 8180 unique hospital admissions for 7638 individual patients from January 2012 to May 2022. Table 1 shows the baseline demographics of hospital admissions. There are 2965 female and 5215 male patients in the dataset. The length of hospital day was 17 days (Q1 to Q3: 10 to 28). Male patients showed slightly longer hospital stay.
Table 1 Demographics and discharge status of the 8180 hospital admissions in the database.
Variables |
Total (n = 8180) |
Female (n = 2965) |
Male (n = 5215) |
p |
Age, n (%) |
|
|
|
< 0.001 |
(0,18] |
35 (0) |
14 (0) |
21 (0) |
|
(18,45] |
1012 (12) |
339 (11) |
673 (13) |
|
(45,65] |
2609 (32) |
859 (29) |
1750 (34) |
|
(65,75] |
1952 (24) |
709 (24) |
1243 (24) |
|
(75,90] |
2044 (25) |
836 (28) |
1208 (23) |
|
(90,150] |
528 (6) |
208 (7) |
320 (6) |
|
Days Hospital Stay, Median (Q1, Q3) |
17 (10, 28) |
16 (10, 26) |
18 (10, 28) |
< 0.001 |
Status On Discharge, n (%) |
|
|
|
0.901 |
Cured |
5666 (73) |
2050 (73) |
3616 (73) |
|
Dead |
438 (6) |
157 (6) |
281 (6) |
|
Not cured |
1202 (16) |
437 (16) |
765 (15) |
|
Unknown |
444 (6) |
153 (5) |
291 (6) |
|
The number of hospital admissions for ICU patients increased remarkably after the year 2018 because the of the expansion of bed number in this year for both comprehensive ICU and emergency ICU (Figure 1). The distributions of hospital length of stay are shown in figure 2, restricting to patients with LOS < 60 days.
Classes of data
The data are organized into tables. There are a total of 10 tables comprising patient demographic data, medical order, laboratory findings, image studies, microbiology and hospital transfer events (Table 2). We will provide more details of each individual table to promote the reuse of our database.
Table 2. A general description of the tables in the database.
Tables |
MD5_hashes |
Size (b) |
Description |
Diagnosis.csv |
3b7ca8b430b16d9ebbd1317cb06cc87b |
25582236 |
Diagnosis |
DrugSens.csv |
2f79976765464593b8eed552221c6359 |
136436217 |
Sensitivity to antibiotics for cultured bacteria |
EMR.csv |
5c6048462b1dc6d44a47687fb34bbc65 |
13730602 |
Electronic medical records for each hsopital admission |
ExamReport.csv |
f771ad05ec45b2b65105744b2c29907f |
52649246 |
Examination report including CT, ultrasound and MRI |
HospitalTransfer.csv |
168fef171980a7cacbe2ed79d1dd63ba |
1441407 |
intrahospital transfer events |
Lab.csv |
4939ebb2155bbcfaff37da8e78f8cc4a |
1828953993 |
Laboratory findings |
Medication.csv |
4e9d16531c6cfb2a7d9aafc21f465e16 |
277782777 |
Medication events |
MedOrder.csv |
297d4e9f5e7e9ba8c3dca5005b8d7684 |
207348204 |
Medical order |
MicrobiologyCulture.csv |
af996f30325ec74eefbf7a25d6680473 |
39362619 |
Microbiology cuture |
VitalSign.csv |
5224b7450833a2ac441d16b459502f32 |
607364251 |
Vital signs |
Electronic medical record
The EMR.csv table contains data related to each hospital admission (Table 3). The patient_SN is a unique ID for individual patient and Hospital_ID is unique ID for hospital admission. If a patient discharged/died within 24 hours, the data were recorded in a separate table, so there are separate columns describing the chief complain and admission status for those short hospital stays. We provide both English and Chinese descriptions for chief complain. The present history recorded in the Med_history column contains more words, and the original Chinese descriptions are kept, so that some natural language processing algorithms can be applied.
Table 3. variables in the EMR table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
Sex |
Sex |
ChiefComplain_24hr |
Chief Complain about patients who discharged within 24 hours after hospital admission |
AdmissionStatus_24hr |
Admission Status for patients who discharged within 24 hours after hospital admission |
ChiefComplain_24hr_dead |
Chief Complain for patients who died within 24 hours after hospital admission |
AdmissionStatus_24hr_dead |
Admission Status for patients who died within 24 hours after hospital admission |
ChiefComplain |
Chief Complain in Chinese |
Med_history |
Medical history in text |
PastHistory |
Past history/comorbidities |
StatusOnDischarge |
Status On Discharge |
DiagnosisOnDeath |
Diagnosis On Death |
StatusOnDischarge_Desc |
Status On Discharge described in text |
DischargeTime |
Discharge time relative to hospital admission time as the time zero in days |
DaysHospitalStay |
Days of Hospital Stay |
ChiefComplain_Eng |
Chief Complain in English |
Age_cut |
Age in category |
Diagnosis table
The diagnosis table contains information related to diagnosis for a hospital stay (Table 4). The Diagnosis_Desc column provides free text description for the diagnosis. ICD10_code is the code number for the standard ICD code. The information can be well processed with the icd package in R (https://github.com/cran/icd). The functionality of the package includes but not limited to finding comorbidities of patients based on ICD-10 codes, Charlson and Van Walraven score calculations, and comprehensive test suite to increase confidence in accurate processing of ICD codes.
Table 4 variables in the Diagnosis table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
Diagnosis_Desc |
Description of diagnosis in free text |
ICD10_code |
ICD-10 code |
ICD10_name |
ICD-10 name for the diagnosis |
Diagnosis_DateTime |
Time for making the diagnosis relative to hospital admission time as the time zero in days |
Hospital transfer table
The HospitalTransfer table contains information related to intrahospital transfer events (Table 5). The time and department of each transfer event are given in respective columns. To protect patients’ privacy, all date time information is recorded as days relative to hospital admission. Since the EICU is in the emergency department, the department names denoted by “Emergency medical department” or “Emergency Department” refer to the EICU.
Table 5. Explanation for variables in the HospitalTransfer table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
TransferIn_dateTime |
time of transfer in in days relative to hospital admission |
TransferOut_dateTime |
time of transfer out in days relative to hospital admission |
TransferTo_Dept_Eng |
department of transfer to |
TransferFrom_Dept_Eng |
department of transfer from |
Lab table
The lab table contains data related to the laboratory findings (Table 6). There are 11,082,482 records of laboratory items in the dataset involving 214 types of laboratory items. there are 17 types of samples being tested for laboratory findings, including whole blood, plasma, urine, serum, arterial blood, stool, venous blood, catheter orifice, ascites, bile, dialysate, CK blood sample (kaolin-activated TEG channel), cerebrospinal fluid, bone marrow, deep venous catheter, sputum, gastric juice. The sample collection time is also recorded in days in reference to the hospital admission time.
Table 6. Explanation for variables in the Lab table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
Lab_category |
Category of lab item |
Lab_time |
Time of lab in days relative to hospital admission |
Lab_results |
Results of the lab finding |
Unit_measure |
Unit of measurement |
LabSampleCollect_time |
Sample collection time in days relative to hospital admission |
Lab_itemName_Eng |
Name of lab item |
Lab_Sample_Eng |
Sample name |
Microbiology culture table
The MicrobiologyCulture table contains information related to microbiology culture results (Table 7). Conventional information regarding sample, culture finding, culture time and description of microbiology culture are provided in the table.
Table 7. Explanation for variables in the Microbiology culture table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
MicrobiologyCulture_finding |
Microbiology Culture finding |
MicrobiologyCulture_time |
Microbiology Culture time in days relative to hospital admission |
MicrobiologyCulture_sample_Eng |
Microbiology Culture sample |
MicrobiologyCulture_Category_Eng |
Microbiology Culture Category |
MicrobiologyCulture_DESC_Eng |
Description of Microbiology Culture |
Drug sensitivity table
The DrugSens table contains information related to drug sensitivity of cultured bacteria (Table 8). Conventional information including sample, microbiology, culture time and drug name are available in the table. The negative and positive values in the DrugSens_result column refers to the results for Ultra broad spectrum β- Lactamase or D-test.
Table 8. Explanation for variables in the Drug sensitivity table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
Drug_Code |
Code of the drug for sensitivity analysis |
DrugSens_result |
Results for Drug Sensitivity test |
MIC |
Minimum inhibitory concentration |
DrugSens_time |
Time for the results relative to hospital admission time as the time zero in days |
Drug_name_Eng |
Name of the tested drug |
DrugSens_Microbiology_Eng |
Microorganism for testing |
DrugSens_Category_Eng |
Category for the test |
DrugSens_sample_Eng |
Sample name |
Examination report table
The ExamReport table contains information related to a variety of medical examinations, including computed topography (CT), X-ray and ultrasound (Table 9). The images are not available in current dataset, but instead we include the free text descriptions and conclusions for these examinations.
Table 9. Explanation for variables in the ExamReport table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
ExamReport_Category |
Category of examination |
ExamReport_DESC |
Description of the examination in free form text |
ExamReport_finding |
Result finding |
ExamReport_time |
Time for the examination results relative to hospital admission time as the time zero in days |
ExamReport_item_Eng |
Name of the Examination |
Medical order table
The MedOrder table contains information related to the medical order prescribed by clinicians (Table 10). The table provides both regular and stat medical orders (MedOrder_Type). The contents of the medical order can be found in the MedOrder_DESC column.
Table 10. Explanation for variables in the MedOrder table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
MedOrder_Type |
Type of medical order: regular or stat |
MedOrder_DESC |
Description of medical order in free text |
MedOrder_Start_DateTime |
Start time of medication in days relative to hospital admission |
MedOrder_Stop_DateTime |
Stop time of medication in days relative to hospital admission |
Medication table
The medication table provides data on the medication orders prescribed by clinicians (Table 11). This table is designed specifically for medication orders, containing columns for drug dose, frequency, unit of drug dose and route of administration.
Table 11. Explanation for variables in the Medication table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
Med_category |
Category of medication |
SingleDose |
Single dose |
Med_Freq |
Frequency of administration |
Med_unit |
Unit of measurement |
Med_startTime |
Start time of medication in days relative to hospital admission |
Med_stopTime |
Stop time of medication in days relative to hospital admission |
Med_route_Eng |
Route of administration |
Med_DESC_Eng |
Medication name in text |
Vitalsign table
The VitalSign table provides vital sign data for each hospital admission (Table 12). The VitalSign_DESC column provides categories of vital signs including diastolic blood pressure, temperature, heart rate and respiratory rate.
Table 12. Explanation for variables in the VitalSign table
Variables |
Explanation |
patient_SN |
Patient series number: unique to each individual subject |
Hospital_ID |
unique to each hospital admission |
VitalSign_DESC |
Vital Sign Description |
VitalSign_value |
Vital Sign value |
VitalSign_unit |
Vital Sign unit of measurement |
VitalSign_time |
Vital Sign measurement time in days relative to hospital admission |
Usage Notes
The dataset can be used for a variety of studies related to critical care medicine, including predictive analytics, model external validation, identification of risk factors and epidemiological studies. Code to generate the dataset is available on GitHub [17]. We will continue to expand the code to facilitate research community and we are also welcome other researchers to contribute code for some useful data extraction. For further details, please read our associated data descriptor paper [18].
Limitations
The dataset is created from electronic healthcare records, there are some missing data and errors, reflecting the real clinical setting. Studies related to temporal trends cannot be performed because the real calendar data were removed.
Release Notes
This is version 1.0 (initial release)
Ethics
The study was approved by the ethics committee of Zhejiang Provincial People's Hospital (approval number: QT2022185). Informed consent was waived as determined by the institutional review board, due to the retrospective design of the study. The study was conducted in accordance with the Declaration of Helsinki.
Acknowledgements
S.J. received funding from Youth Talents Project of Health Commission of Zhejiang Province (Project number: 2019323925). Z.Z. received funding from Yilu “Gexin” - Fluid Therapy Research Fund Project (YLGX-ZZ-2020005), Health Science and Technology Plan of Zhejiang Province (2021KY745).
Conflicts of Interest
There is no competing interest to declare.
References
- Elias, K. M., Moromizato, T., Gibbons, F. K. & Christopher, K. B. Derivation and validation of the acute organ failure score to predict outcome in critically ill patients: a cohort study. Crit Care Med 43, 856–864 (2015).
- Yehya, N. & Wong, H. R. Adaptation of a Biomarker-Based Sepsis Mortality Risk Stratification Tool for Pediatric Acute Respiratory Distress Syndrome. Crit Care Med 46, e9–e16 (2018).
- Chu, C. D. et al. Trends in Chronic Kidney Disease Care in the US by Race and Ethnicity, 2012-2019. JAMA Netw Open 4, e2127014 (2021).
- Höfler, M. Causal inference based on counterfactuals. BMC Med Res Methodol 5, 28 (2005).
- Zhang, Z., Chen, L., Xu, P. & Hong, Y. Predictive analytics with ensemble modeling in laparoscopic surgery: A technical note. Laparoscopic, Endoscopic and Robotic Surgery (2022) doi:10.1016/j.lers.2021.12.003.
- Valik, J. K. et al. Validation of automated sepsis surveillance based on the Sepsis-3 clinical criteria against physician record review in a general hospital population: observational study using electronic health records data. BMJ Qual Saf 29, 735–745 (2020).
- Zhang, Z. et al. Analytics with artificial intelligence to advance the treatment of acute respiratory distress syndrome. J Evid Based Med 13, 301–312 (2020).
- Forero, D. A., Curioso, W. H. & Patrinos, G. P. The importance of adherence to international standards for depositing open data in public repositories. BMC Res Notes 14, 405 (2021).
- Shahin, M. H. et al. Open Data Revolution in Clinical Research: Opportunities and Challenges. Clin Transl Sci 13, 665–674 (2020).
- Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5, 180178 (2018).
- Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
- Thoral, P. J. et al. Sharing ICU Patient Data Responsibly Under the Society of Critical Care Medicine/European Society of Intensive Care Medicine Joint Data Science Collaboration: The Amsterdam University Medical Centers Database (AmsterdamUMCdb) Example. Crit Care Med 49, e563–e577 (2021).
- Zeng, X. et al. PIC, a paediatric-specific intensive care database. Sci Data 7, 14 (2020).
- Xu, P. et al. Critical Care Database Comprising Patients With Infection. Front Public Health 10, 852410 (2022).
- Wickham, H. et al. Welcome to the Tidyverse. Journal of Open Source Software 4, 1686 (2019).
- Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215-220 (2000).
- Code used to generate the CSV files of the Chinese critical care database on Github. https://github.com/zh-zhang1984/ZhejiangProvinceICU/blob/main/ZhejiangProvinceICU.md [Accessed: 18 January 2023]
- Senjun J, Lin C, Kun C, Hu C, Hu S, and Zhongheng Z. Establishment of a Chinese critical care database from electronic healthcare records in a tertiary care medical center. https://doi.org/10.1038/s41597-023-01952-3. Sci Data (2023).
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0):
https://doi.org/10.13026/901c-yv54
DOI (latest version):
https://doi.org/10.13026/3h21-rc35
Topics:
database
china
critical care