Database Contributor Review

CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools

Eulalia Farre Maduell Salvador Lima-Lopez Santiago Andres Frid Artur Conesa Elisa Asensio Antonio Lopez-Rueda Helena Arino Elena Calvo Maria Jesús Bertran Maria Angeles Marcos Montserrat Nofre Maiz Laura Tañá Velasco Antonia Marti Ricardo Farreres Xavier Pastor Xavier Borrat Frigola Martin Krallinger

Published: April 20, 2024. Version: 1.0.1


When using this resource, please cite: (show more options)
Farre Maduell, E., Lima-Lopez, S., Frid, S. A., Conesa, A., Asensio, E., Lopez-Rueda, A., Arino, H., Calvo, E., Bertran, M. J., Marcos, M. A., Nofre Maiz, M., Tañá Velasco, L., Marti, A., Farreres, R., Pastor, X., Borrat Frigola, X., & Krallinger, M. (2024). CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools (version 1.0.1). PhysioNet. https://doi.org/10.13026/x7ed-9r91.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.


Background

There is a pressing need to enable access to annotated electronic health records (EHRs) for the development and evaluation of clinical NLP resources, with the aim to implement de-identification tools and to detect medical variables of interest. This is particularly true for non-English EHRs, where only a limited number of resources have been published. Due to the large number of hospital data generated in Spanish speaking countries and the potential of adapting NLP technologies originally developed for content in Spanish to other romance languages (with more than 900 million native speakers), the release of clinical records in Spanish is now essential.

The COVID-19 pandemic demonstrated the urgent need for systems capable of processing and analyzing high volumes of unstructured data locked in large collections of clinical narratives in order to identify patterns, trends, and actionable clinical insights.

This corpus was created to address the demand for these data and to foster the development of clinical NLP tools able to cope with the particularities of real-world clinical language (complex medical jargon, use of abbreviated expressions, typos and spelling errors or ungrammatical sentences).

CARMEN–I is a collaboration between clinical experts from the University Hospital Clínic of Barcelona (HCB) and researchers in AI and NLP from the NLP4BIA at the Barcelona Supercomputing Center (BSC). It consists of a publicly-released set of real clinical records. It is the first corpus of complete, publicly released de-identified EHRs in Spanish covering not only COVID-19, but also a range of comorbidities including cancer and cardiovascular diseases. Sensitive data items have been identified, masked and replaced following the approach of previous efforts in Spanish such as HitzalMed [1].


Methods

CARMEN-I consists of 2,000 documents selected from the EHRs of 6,811 patients with COVID-19. Documents were written in the Hospital Clínic of Barcelona, one of the main tertiary hospitals in Spain, sampled over a two-year period (March 2020 to March 2022).

The dataset’s Personal Health Identifiers (PHI) have been carefully anonymised following a protocol created with the cooperation of clinicians, linguists, and Natural Language Processing (NLP) experts. Based on the annotation guidelines of the MEDDOCAN anonymisation corpus [2], the reports were reviewed by linguists, who verified the annotation criteria and amended suggestions provided by automatic anonymisation models. With the support from computational linguists from the NLP4BIA team, clinicians next verified that all sensitive information in the annotated documents and new resynthesized version was masked or hidden from plain sight. Here, masking consisted of replacing the annotation with its semantic class (for instance, the annotation 3/2/2022 was replaced by a label “DATE”).

After this process, clinicians at the HCB reviewed each masked report and further assessed the documents before including it in the corpus. To this end, they first validated the annotation of sensitive items as correct, and then whether the report met extra criteria. These criteria take into account indirect sensitive data, especially from a clinical point of view (e.g., an uncommon combination of comorbidities or extremely rare diseases). This entire process, including the criteria followed, can be found in the Anonymization Protocol, a document published together with the corpus and available in Zenodo in Spanish [3] and English [4]

In addition, a second version of the data was created in which synthetic equivalents replaced the original sensitive items. These replacements were generated using a different rule-based systems specific for each type of sensitive data and custom-created gazetteers (lists of terms). Special attention was paid to creating credible replacements. As an example, the replacement process for dates uses a complex logic to maintain a coherent timeline within the document. The process is the following: initially, a rule-based system and gazetteers are employed to parsing dates into smaller parts such as days, months, years, as well separators like slashes or dots. Written dates are transformed into numerical forms using Spanish and Catalan month names from the gazetteer. Dates are then categorized into specific types (e.g., year-only, month and year) to facilitate the replacement process. Subsequently, modifications are applied to days, months, and years in each document using randomly chosen values to ensure temporal coherence and document-specific anonymization. The modifications may be positive or negative, shifting the date into the future or past. Lastly, Python's datetime library is utilized to calculate the modified dates while preventing illogical dates (e.g., day 32 or month 13) from being generated.

Finally, a subset of 500 documents was selected and annotated with relevant clinical concept types (diseases, symptoms/findings, procedures, drugs, species and humans) for the development and benchmarking of information extraction systems. The annotation strategy followed for these entity types was based on publicly released corpora created by the NLP4BIA, such as DisTEMIST [5] for diseases, MedProcNER/ProcTEMIST [6] for clinical procedures or LivingNER [7] for species. An annotation guideline summarizing the rules for all six entities is available in Zenodo in Spanish [8] and English [9]. The annotation took place with the collaboration of HCB clinicians, who contributed their knowledge of the hospital setting.

More details about the dataset’s creation, de-identification and annotation process will be provided in a soon-to-be-published article. This project was approved by the HCB Ethics Research Committee. Individual patient consent was waived because the project did not impact clinical care and all protected health information was anonymized.


Data Description

CARMEN-I is a collection of 2,000 anonymized clinical records written in a University hospital (HCB). Specifically, the texts were written from March 1, 2020 to March 1, 2022.

From a clinical point-of-view, the corpus includes patients with different presentations of COVID-19, mostly in severe form. In addition, since many of the patients had comorbidities and underlying conditions, the corpus also contains diseases that cause immunosuppression (patients undergoing treatment for cancer and organ transplant, treatment with corticosteroids, patients infected with HIV), respiratory diseases (asthma, COPD), cardiovascular diseases, geriatric complexity, and other.

From a linguistic point-of-view, the clinical records present typical characteristics of electronic health records, namely: a large number of ad-hoc acronyms and abbreviations, typos, repetitions, incomplete sentences, and other. Additionally, since the documents come from a Catalan hospital, some are written both in Spanish and Catalan, often mixed in the same document and sentence. It has been estimated that around 15% of the documents include Catalan to some extent. The language for each document has been classified and is made available on an accompanying .TSV file.

CARMEN-I includes five types of medical document: discharge reports (in Spanish, “informe de alta or IA”); referral letters (“informe de traslado or IT”); death reports (“informe de exitus or IE”); progress notes (curso clínico or CC); and imaging reports (informe de radiología or IR). Most documents are discharge, referral and imaging reports, with a few progress notes and death reports included due to their interest for the annotation process.

Discharge and Referral letters are composed of the following sections: Medical History (Antecedentes); Progress Notes (Evolución); Physical Exam (Exploración Clínica); Medical Tests (Exploración Complementaria) Surgery (Intervención Quirúrgica); Treatment Plan (Plan Terapéutico); Current Problem (Proceso Actual); Imaging description (Radiografía); and Follow-up (Seguimiento).

Corpus Versions

The corpus is presented in two different versions and both versions include annotations for sensitive and clinical entities.

In total, the corpus includes 2,000 documents, classified as follows: 1,201 imaging reports; 617 discharge reports; 172 referral letters; 5 death reports; and 5 progress notes. Discharge reports and referral letters are not presented in full. Instead, they were divided into sections as stated above due to their excessive length, resulting in: 189 medical histories; 154 progress notes; 72 physical exams; 61 medical tests; 25 surgery reports; 31 treatment plans; 176 current problems; 40 imaging; and 41 follow-up sections.

As for the sensitive data annotation, the corpus includes 18 different labels with a total number of 8,228 annotated items. The most common label type is dates, with 5,384 annotations, followed by patient ages with 815, and patient genders with 458 items. The least common type are websites, with only one annotation.

500 documents in the collection include clinical concept annotations for the development of named entity recognition systems. There are a total of 26,545 annotations for 6 different concept types: 5,402 diseases; 8,137 findings; 6,520 procedures; 3,547 drugs; 1,564 species; and 1,375 humans.

Files format and structure

The reports are offered as .txt files, with the annotations being available as stand-off files (i.e. separate files) in multiple formats. The CARMEN-I text files are offered in two versions: with masked sensitive data (e.g. '01/01/2020' becomes 'FECHAS'; `masked/` folder) and replaced sensitive data (e.g. '01/01/2020' becomes '03/07/2013'; `replaced/` folder). The annotations are offered in the brat's standalone .ann format [10] as well as tab-separated value files (.tsv). Each “.tsv” file contains the following columns: name (associated filename), tag (annotation label), span (start and end character position in text), text (annotation content). Additionally, brat configuration files (annotation.conf and visual.conf) are also provided. For more information about `.ann` format please visit brat's website [11]. Annotations for Personal Health Identifiers and clinical entities are given separately.

All filenames follow the same naming convention: CARMEN-I_{report_type}_{section_type}_{number}.{extension}. For instance: CARMEN-I_IA_ANTECEDENTES_2.txt.

Possible report types are: CC (curso clínico, or clinical notes), IA (informe de alta, or discharge report), IT (informe de traslado, or transfer report), IE (informe de exitus, or death report) and IR (informe de radiología, or radiology report).

Discharge and transfer reports are divided in sections. Possible section types are antecedentes (medical history), evolución (progress notes), exploración clínica (physical examination), exploración complementaria (medical tests), intervención quirúrgica (surgery), plan terapéutico (treatment plan), proceso actual (current problem), radiografía (imaging), and seguimiento (follow-up).

Finally, the dataset includes a file called “CARMEN1_mappings.tsv”, in which every file is classified in two aspects: its language (“es” for Spanish, “cat” for Catalan, “bi” for bilingual texts that include a mix of both languages) and whether it has clinical concept recognition annotations (either “True” or “False”).

All in all, the folder structure is the following:

  • txt/
    • masked/
    • replaced/
  • ann/
    • masked/
      • anon/
      • ner/
    • replaced/
      • anon/
      • ner/
  • tsv/
    • masked/
    • replaced/

Usage Notes

CARMEN-I is intended for use as a gold standard to train and test NLP tools under development. It is not suitable for clinical research because data related to persons, dates, ages, locations, centers, etc. have been completely substituted by other values. If the user detects any expression with suspected possible identification, it is their obligation to immediately notify the CARMEN-I authors at [13].


Release Notes

Version 1.0.1: Progress has been made in the corpus's post-processing, and some incorrect entities have been fixed. We have also updated the description document. For instance, we have corrected the statistics of the text and corrected some editorial errors in the methodology. We have rewritten parts on anonymization for a better understanding and finally we added links to the anonymization guidelines, and English and Spanish protocols have been included.


Ethics

This project was approved by the HCB Ethics Research Committee. Individual patient consent was waived because the project did not impact clinical care and all protected health information was anonymized.


Acknowledgements

We thank the following people for their participation in the creation of Carmen-I: firstly, all the clinical specialists in the HCB who kindly contributed their expertise during the height of the COVID-19 pandemic: Helena Ariño, Elisa Asensio, Elena Calvo, Maria Ángeles Marcos; also, the valuable contribution of the Universitat de Barcelona’s CLiC group: Montse Nofre, Laura Tañá, and Maria Antònia Martí. Finally, Ricard Farreres from Words for Knowledge IT, for his assistance in preprocessing the text documents.

We would also like to acknowledge the funding of the Spanish Government’s Encargo del PlanTL to the BSC.


Conflicts of Interest

The authors declare no conflict of interest.


References

  1. Salvador Lima-López, Naiara Perez, Laura García-Sardiña, and Montse Cuadros. 2020. HitzalMed: Anonymisation of Clinical Text in Spanish. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 7038–7043, Marseille, France. European Language Resources Association.
  2. Montserrat Marimon, Aitor Gonzalez-Agirre, Ander Intxaurrondo, Heidy Rodríguez, Jose Antonio Lopez Martin, Marta Villegas, and Martin Krallinger. Automatic De-Identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019).
  3. Salvador Lima-López, Eulàlia Farré-Maduell, Luis Gasco Sánchez, Antonio López Rueda, Santiago Frid, Artur Conesa, Xavier Pastor, and Martin Krallinger. “CARMEN-I: Anonymization Protocol for Clinical Reports in Spanish” [Online]. Zenodo, Available on: zenodo.org/doi/10.5281/zenodo.10171660 [Last accessed: 21-Nov-2023]
  4. Salvador Lima-López, Eulàlia Farré-Maduell, Luis Gasco Sánchez, Antonio López Rueda, Santiago Frid, Artur Conesa, Xavier Pastor, and Martin Krallinger. “CARMEN-I: Anonymization Protocol for Clinical Reports in English” [Online]. Zenodo, Available on: zenodo.org/doi/10.5281/zenodo.10171681 [Last accessed: 21-Nov-2023]
  5. Antonio Miranda Escalada, Luis Gascó, Salvador Lima-López, Eulàlia Farré-Maduell, Darryl Estrada, Anastasios Nentidis, Anastasia Krithara, Georgios Katsimpras, Georgios Paliouras, and Martin Krallinger. "Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources." In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings. 2022.
  6. Salvador Lima-López, Eulàlia Farré-Maduell, Luis Gascó, Anastasios Nentidis, Anastasia Krithara, Georgios Katsimpras, Georgios Paliouras, and Martin Krallinger. "Overview of MedProcNER task on medical procedure detection and entity linking at BioASQ 2023." In Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum. 2023.
  7. Antonio Miranda-Escalada, Eulàlia Farré-Maduell, Salvador Lima-López, Darryl Estrada, Luis Gascó, and Martin Krallinger. "Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of LivingNER shared task and resources." Procesamiento del Lenguaje Natural (2022).
  8. Salvador Lima-López, Eulàlia Farré-Maduell, and Martin Krallinger. “CARMEN-I: Clinical Entities Annotation Guidelines in Spanish” [Online]. Zenodo, Available on: zenodo.org/doi/10.5281/zenodo.10171539 [Last accessed: 21-Nov-2023]
  9. Salvador Lima-López, Eulàlia Farré-Maduell, and Martin Krallinger. “CARMEN-I: Clinical Entities Annotation Guidelines in English” [Online]. Zenodo, Available on: zenodo.org/doi/10.5281/zenodo.10171646 [Last accessed: 21-Nov-2023]
  10. Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou and Jun'ichi Tsujii (2012). brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL 2012.
  11. brat standoff format. [Online]. Available on: https://brat.nlplab.org/standoff.html. [Last accessed: 19-Jul-2023]
  12. Creative Commons. (n.d.). Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Retrieved June 25, 2023, from https://creativecommons.org/licenses/by-sa/4.0/
  13. Hospital Clínic. (s.f.). Email communication: Notification of personal data finding in the corpus [Email to infosic@clinic.cat].

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.

License (for files):
PhysioNet Contributor Review Health Data License 1.5.0

Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.
Versions
  • 1.0 - Nov. 2, 2023
  • 1.0.1 - April 20, 2024

Files