Database Contributor Review

COVID Data for Shared Learning (CDSL): A comprehensive, multimodal COVID-19 dataset from HM Hospitales

Álvaro Ritoré Andreea M Oprescu Alberto Estirado Bronchalo Miguel Ángel Armengol de la Hoz

Published: Oct. 25, 2024. Version: 1.0.0


When using this resource, please cite: (show more options)
Ritoré, Á., Oprescu, A. M., Estirado Bronchalo, A., & Armengol de la Hoz, M. Á. (2024). COVID Data for Shared Learning (CDSL): A comprehensive, multimodal COVID-19 dataset from HM Hospitales (version 1.0.0). PhysioNet. https://doi.org/10.13026/1176-6c44.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

COVID Data for Shared Learning (CDSL) is a multimodal database comprising de-identified medical data from 4,479 patients who were hospitalized with confirmed or suspected COVID-19 in the Spanish 'HM Hospitales' group from 2019-12-26 to 2021-02-13. The database provides tabular demographic, diagnostic, clinical and treatment information, as well as radiological images in JPG format, namely chest X-ray and computed tomography scans. The primary goal of CDSL is to develop a comprehensive toolkit to support researchers and institutions in building multimodal models for prediction, classification and optimization. CDSL database was promptly shared with the international research community at the onset of the COVID-19 pandemic to promote worldwide collaboration and, ultimately, guide policy decisions and facilitate effective response efforts studies on the disease.


Background

The COVID-19 pandemic posed significant challenges in terms of treatment approaches due to limited early-stage information and the need for real-time decision-making [1]. While randomized trials were underway to assess various treatment strategies, retrospective data analysis were instrumental in guiding care decisions [2]. However, unlike most existing datasets at the pandemic onset [3], CDSL offered comprehensive clinical data incorporating imaging data along with structured data to provide a more complete understanding of COVID-19. Sharing high-resolution data globally was essential to leverage international expertise and help overcome these limitations [4]. Most notably, CDSL was the first multimodal open COVID-19 repository to be shared among researchers worldwide, providing detailed structured health data and valuable medical images to advance clinical knowledge and improve patient management during the pandemic. In this regard, CDSL was originally published online on the 'HM Hospitales' website (the first CDSL instance available since April 2020), where access to the database could be requested [5].


Methods

The CDSL database contains Electronic Health Records (EHR) and radiological images collected from the Spanish private hospital network 'HM Hospitales'. Multiple sources were utilized throughout the patient hospitalization process: the Emergency Department (ED) information system, the Hospital EHR database, the Intensive Care Unit (ICU) information system, and the Radiology information system.

A unique admission identifier (patient_id) was assigned to each patient, allowing data from CDSL's different tables to be connected. Overall, protected health information (PHI) was de-identified in compliance with the Health Insurance Portability and Accountability Act (HIPAA) to ensure patient privacy.

Electronic Health Records

Firstly, data for patients who were hospitalized with confirmed or suspected COVID-19 in the Spanish 'HM Hospitales' group were extracted from internal databases. Secondly, data were prepared and organized in tables linked to the patient_id. Finally, sensitive information such as identifiers and dates was de-identified. Resulting EHR tables are provided in CSV format. In this regard, the number of transformation steps was minimized in order to capture the raw data gathered at HM Hospitals, promoting data validation while improving usability.

In total, CDSL includes structured EHR on demographics, diagnoses, vital signs, medications and laboratory results for 4,479 unique hospitalized patients admitted from December 2019 to February 2021.

Radiological Images

Original chest radiographs (X-rays) and computed tomographies (CT) were collected and stored in DICOM format with integrated metadata. As part of the de-identification process, these were then converted into JPG format files, and relevant metadata was extracted and exported as a separate file, following the approach of the MIMIC-CXR-JPG repository [6]. The main steps in the conversion to JPG process involved extracting and normalizing pixel values to the range [0, 255] using the pydicom library, applying the VOI LUT (if available) to enhance the visualization of specific ranges of pixel values relevant for diagnosis, handling inverted pixel values using PhotometricInterpretation, and saving the images in JPG format with a quality factor of 95.

The rationale for using JPG files instead of DICOM is to enhance accessibility for non-medical researchers by simplifying the complex DICOM format. Additionally, JPG files significantly reduce storage size through lossy compression, making them more efficient to store and process. This conversion facilitates broader integration and use in computer vision research, where JPG is a commonly used format.

CDSL comprises a total of 1,444,764 radiological images, consisting of 4,608 X-rays and 1,440,156 CT scans, along with de-identified metadata in a structured format.

Anonymization

Anonymization procedures were implemented to ensure patient privacy by de-identifying information from tabular data and images:

  • Information belonging to the 18 identifying categories in the HIPAA has been removed.
  • Unique patient identifiers were de-identified.
  • Dates were randomly shifted using a single date shift for each patient_id while preserving the time of day and day of the week, thereby maintaining chronological consistency. As a result, if two different measurements are separated by 2 hours in the raw data, the final time difference in the provided tables will also be 2 hours. However, temporal consistency does not apply to distinct patients; two patients admitted in 2150 were not necessarily admitted in the same actual year.
  • DICOM metadata, including Unique Identifiers (UIDs), patient identifiers, and dates, have been completely removed during the conversion to JPG format. De-identified information for the radiological images is available in a separate tabular file.

Data Description

The CDSL database provides tabular demographic, diagnostic, clinical and treatment information, as well as radiological images in JPG format, namely chest X-ray and computed tomography scans.

Overview

CDSL is composed of 6 structured relational tables linked by the patient_id identifier, a folder with additional tables, a folder containing radiological images, and the DICOM metadata file, more specifically:

  • patient_01.csv: The patient table contains information regarding demographics, admission to the hospital, ED and, if admitted, to ICU, as well as first and last vital sign measurements.
    • Demographics: age, sex.
    • Hospital admission information: COVID-related diagnosis (diag_inpat), hospital admission date (admission_d_inpat), hospital discharge date (discharge_date), hospital discharge destination (destin_discharge).
    • ICU Admission Information: dates of admission and discharge from ICU (icu_date_in, icu_date_out), nº of days in ICU (icu_days), number of times the patient was admitted to the ICU during their hospital stay (icu_n_ing).
    • ED Information: date and time of emergency visit that led to admission (admission_date_emerg, time_admission_emerg), diagnosis given in the ED (diag_emerg), and the department where the patient was treated (department_emerg). Initial vital signs taken in the ED include the time of first assessment (time_constant_first_emerg), blood pressure (bp_max_first_emerg, bp_min_first_emerg), body temperature (temp_first_emerg), heart rate (hr_first_emerg), oxygen saturation level (sat_02_first_emerg), blood glucose level (glu_first_emerg), and urine output (diuresis_first_emerg). The last vital signs taken before leaving the ED are also recorded: the time of last assessment (time_constant_last_emerg), blood pressure (bp_max_last_emerg, bp_min_last_emerg), body temperature (temp_last_emerg), heart rate (hr_last_emerg), oxygen saturation level (sat_02_last_emerg), and blood glucose level (glu_last_emerg). The patient's destination upon leaving the ED is recorded (destin_emerg); the majority of these patients were admitted to the hospital (inpatient).
    • Additional information:
      • ant_admission_date_in - the date of any previous admissions for the patient.
      • ant_diag_inpat - the diagnosis given during any previous inpatient admissions.
      • mechvent - whether the patient was on mechanical ventilation during their stay.
  • diagnosis_er_02.csv: This table includes ED diagnoses, if any, according to the ICD-10 (International Classification of Diseases, 10th Revision), along with procedures. It's important to note that COVID-19 was not included in the ICD-10 version available at the time of CDSL registrations.
    • diagnosis_er_adm_id - unique identifier for each ED admission diagnosis set (one row per patient).
    • dia_ppal - principal diagnosis.
    • dia_02 to dia_12 - secondary diagnoses.
    • proc_01 to proc_05 - codes for up to 5 procedures performed during the emergency visit, according to the ICD-10-PCS (International Classification of Diseases, 10th Revision, Procedure Coding System).
  • diagnosis_hosp_03.csv: This table comprises records of hospital (inpatient) diagnoses (ICD-10) and procedures.
    • diagnosis_hosp_adm_id - unique identifier for each hospital admission diagnosis set (one row per patient).
    • dia_ppal, poad_ppal - principal diagnosis and its Present on Admission (POA) indicator ('Y'; Yes; 'N': No; 'U': Insufficient documentation; 'W': Clinically undetermined; '1': Exempt).
    • dia_02 to dia_19, poad_02 to poad_19 - secondary diagnoses and their corresponding POA indicators.
    • proc_01 to proc_20 - codes for up to 20 procedures performed during the hospital admission.
    • neo_01 to neo_06 - codes for up to 6 neoplasms diagnosed during the hospital admission, according to the ICD-O-3 (International Classification of Diseases for Oncology, Third Edition).
  • vital_signs_04.csv: The vital_signs table provides basic records of vital signs collected during admission, including their date and time of registration (224,146 records).
    • vital_sign_id - unique identifier for each vital sign record.
    • constants_ing_date - date when the constant was recorded.
    • constants_ing_time - time when the constant was recorded.
    • bp_max_ing - maximum blood pressure.
    • bp_min_ing - minimum blood pressure.
    • temp_ing - body temperature.
    • hr_ing - heart rate.
    • sat_02_ing - oxygen saturation.
    • sat_02_ing_obs - observations (in Spanish) related to the oxygen saturation measurement.
    • glu_ing - blood glucose level.
  • medication_05.csv: This table displays all the medications administered for a given patient during hospital admission (115,649 records).
    • medication_id - unique identifier for each medication record.
    • drug_comercial_name - the commercial name of the drug administered (in Spanish).
    • id_atc5 - identifier for the ATC5 classification of the drug.
    • id_atc7 - identifier for the ATC7 classification of the drug.
    • daily_avrg_dose - average daily dose administered.
    • drug_start_date - the date when the administration of the drug started.
    • drug_end_date - the date when the administration of the drug ended.
  • lab_06.csv: It contains 786,984 laboratory tests conducted in both the ED and the hospital setting during the patient's stay.
    • lab_id - unique identifier for each laboratory test record.
    • lab_number - Identifier of the laboratory request (it starts with "I" for Inpatient episode and "U" for emergency episode).
    • lab_date - date of request to the laboratory.
    • time_lab - time of request to the laboratory (only for emergency).
    • item_lab - laboratory test item.
    • val_result - result value of the laboratory test (numerical).
    • result_text - result value of the laboratory test (free text).
    • ud_result - unit of measurement for the laboratory test result.
    • ref_values - reference range values for interpreting the laboratory test result.
  • additional_tables:
    • ICD-10 code dictionary (2021).
    • Unique description of atc5 categories (in Spanish) from the medication_05 table.
    • Unique description of atc7 categories (in Spanish) from the medication_05 table.
  • IMAGES: This folder displays the set of available chest X-rays and CT scans in JPG format. A total of 1,444,764 JPG files belonging to 1,859 distinct patients are provided.
  • CDSL-1.0.0-dicom-metadata.csv - Relevant metadata from the original DICOM files.

JPG files are organized in individual subdirectories following the pattern files/pNN/pXXXXXXXX/sZZZZZZZZ/ZZZZZZZZ_VVVV, where:

  • NN is the first two digits of the patient_id,
  • XXXXXXXX is the patient_id,
  • ZZZZZZZZ is the study_id, and
  • VVVV is the view number.

An example of the JPG directory structure is illustrated below:

IMAGES/
  p10/
    p10484178/
      s15410429/
        15410429_0001.jpg
      s45095325/
        45095325_0001.jpg
        ...
        45095325_2117.jpg
  p47/
    p47285296/
      s10370535/
        10370535_0001.jpg
      s63709389/
        63709389_0001.jpg
      s74623196/
        74623196_0001.jpg

Here we show two patients, one under the p10 directory and another under the p47 directory. Patient p10484178 has two studies: s15410429, containing one image, and s45095325, with 2,117 images. Patient p47285296, under the p47 directory, displays three studies, each with only one image. Study identifiers are random, so their numbering doesn't follow any chronological order. However, temporal consistency is maintained among study times for each patient. Overall, studies with one image are X-rays, while studies with CT scans contain hundreds of images.

Metadata files

CDSL-1.0.0-dicom-metadata.csv file provides metadata from the original DICOM files, containing the following information:

  • Folders id - Following the previous folder scheme pattern, the columns patient_group_folder_id (pNN), patient_folder_id (pXXXXXXXX) and study_id (sZZZZZZZZ) are provided.
  • patient_id - Unique patient identifier, connecting the medical image to the other structured tables within the CDSL dataset.
  • image_id - Unique identifier for each JPG (JPG filename).
  • StudyDate - Anonymized date for the medical image.
  • StudyTime - Time of the study in hours, minutes and seconds. This has not been de-identified to keep information on the hour of the day.
  • Modality - Type of radiological image: computed radiography (CR) or digital radiography (DX) for X-rays images, and computed tomography scans (CT).
  • StudyDescription - Brief description of the medical image.
  • BodyPart - Recategorized 'BodyPartExamined' element from DICOM metadata.
  • CodeValue - A unique code within the 'Procedure Code Sequence' DICOM element.
  • CodeMeaning - The human readable description associated with the corresponding CodeValue.
  • ViewPosition - View position for X-ray images: AP (Anteroposterior), PA (Posteroanterior) and LL (Lateral). An 'Unknown' value is set for CT images.
  • PatientPosition - The way in which the patient was positioned during the CT: HFS (Head First Supine), FFS (Feet First Supine), FFP (Feet First Prone). An 'Unknown' value is set for radiographs.
  • SpatialResolution - Specific spatial resolution, namely the ability of an imaging system to distinguish separate objects in an image. These could be 0.148, 0.125 or 'Unknown'.

Except for the folders id, patient_id, and image_id, the remaining columns were directly extracted from their respective DICOM data elements, such as 'Modality' (0008, 0060). A minor data cleaning process was conducted: inconsistent values between 'StudyDescription' and the columns 'Modality' and 'CodeMeaning' were rectified by selecting the 'Modality' value for incorrect 'StudyDescription' entries; unknown values in the original 'BodyPartExamined' column were imputed with values from the 'StudyDescription' column.


Usage Notes

Access requests will be assessed by the HM Hospitales Data Science Commission and, where appropriate, reviewed by the HM Hospitales Clinical Research Ethics Committee to be granted access. Due to the sensitive nature of the collected data, which includes patients' clinical care information, utmost regard must be given to ethical, moral, and care principles when handling and processing the data. An early version of the CDSL dataset was made available on the 'HM Hospitales' website [5].

The deployment code for CDSL database can be publicly accessed through an open source repository to foster interdisciplinary collaboration among the research community [7]. The code utilized to generate the tables in PostgreSQL, metadata extraction and DICOM conversion to JPG, are provided (at the CovidHM_V3-code folder).

CDSL comprises raw data collected within HM Hospitals, with a few modifications including de-identification procedures, minor data cleaning, and translation of content to English. However, while the majority of its content is in English, certain table columns contain values in Spanish, such as 'drug_commercial_name' in the medication table. We kindly encourage researchers to provide contributions in terms of code correction, improvement and/or optimization in order to enhance data architecture issues and cleaning activitiies. If you need guidance on a particular question, we recommend opening an issue in the CDSL Code Repository [7].


Release Notes

CDSL v1.0.0: First release on PhysioNet. The dataset comprises the following tabular data: patient, diagnosis_er, diagnosis_hosp, vital_signs, medication and lab from a total of 4,479 hospitalized patients with confirmed or suspected COVID-19, along with 1,444,764 radiology images and relevant metadata, specifically 4,608 radiographs and 1,440,156 CT scans.

CDSL alpha release: An early version of the CDSL dataset was made available on the 'HM Hospitales' website [5], under the name "Covid Data Saves Lives".


Ethics

CDSL has received approval for release from HM Hospitales Clinical Research Ethics Committee. Authorization for the publication, access, and use of the CDSL dataset on PhysioNet has been granted with the internal authorization code auth_opendata_cdsl3_170624.


Acknowledgements

The authors would like to thank HM Hospitales for their groundbreaking open data initiative, demonstrating the feasibility of data democratization not only in Europe but also in Spain.


Conflicts of Interest

The author(s) have no conflicts of interest to declare.


References

  1. Dong Y, Shamsuddin A, Campbell H, Theodoratou E. Current COVID-19 treatments: Rapid review of the literature. J Glob Health. 2021 Apr 24;11:10003.
  2. Loo WK, Hasikin K, Suhaimi A, Yee PL, Teo K, Xia K, et al. Systematic Review on COVID-19 Readmission and Risk Factors: Future of Machine Learning in COVID-19 Readmission Studies. Front Public Health. 2022 May 23;10.
  3. Sauer CM, Dam TA, Celi LA, Faltys M, de la Hoz MAA, Adhikari L, et al. Systematic Review and Comparison of Publicly Available ICU Data Sets—A Decision Guide for Clinicians and Data Scientists. Crit Care Med. 2022 Jun 2;50(6):e581–8.
  4. Tanwar AS, Evangelatos N, Venne J, Ogilvie LA, Satyamoorthy K, Brand A. Global Open Health Data Cooperatives Cloud in an Era of COVID-19 and Planetary Health. OMICS. 2021 Mar 1;25(3):169–75.
  5. HM hospitales. https://www.hmhospitales.com/prensa/notas-de-prensa/comunicado-covid-data-save-lives [Accessed 27/05/2024]
  6. Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet. https://doi.org/10.13026/8360-t248
  7. CDSL Code Repository. https://github.com/theonesp/covidhm-code/tree/master/CovidHM_V3-code [Accessed 14/07/2024]

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.

License (for files):
PhysioNet Contributor Review Health Data License 1.5.0

Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files