Database Open Access

Synthetic Mention Corpora for Disease Entity Recognition and Normalization

Kuleen Sasse John David Osborne

Published: Feb. 3, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Sasse, K., & Osborne, J. D. (2025). Synthetic Mention Corpora for Disease Entity Recognition and Normalization (version 1.0.0). PhysioNet. https://doi.org/10.13026/p5pn-ty93.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Named Entity Recognition (NER) and Entity Normalization (EN) are fundamental tasks in information extraction, particularly in the biomedical and clinical domains. NER identifies textual mentions of entities, while EN maps these mentions to unique identifiers within a structured vocabulary. However, the biomedical domain presents unique challenges for NER, including the diverse and inconsistent lexical representations of biomedical concepts, such as non-standard terminology, abbreviations, complex phrases, and frequent misspellings in clinical texts. Additionally, rare entities are often underrepresented in training datasets and may lack detailed descriptions or synonyms in knowledge graphs, limiting the quality of training data for Disease Entity Recognition (DER) and Disease Entity Normalization (DEN). To address this, we present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, a dataset comprising 128,000 synthetic disease mentions generated using a fine-tuned LLaMa-2-13B-Chat model. These mentions are derived from the Unified Medical Language System (UMLS) disorder group. This corpus aims to enhance the development of more robust systems for disease entity identification and linking in biomedical and clinical text, addressing current limitations in training data availability.


Background

Named Entity Recognition (NER) and Entity Normalization (EN) of those entities to a vocabulary are core tasks in information extraction. NER identifies a mention of an entity in a text span or spans. EN assigns a unique identifier to that mention for a given vocabulary. In the biomedical and clinical space, the NER task is complicated by the diverse lexical forms that biomedical concepts can take, including non-standard names, abbreviations, complex conditions, and misspellings in clinical text [1]. Normalization of these entities is also challenging due to the size of vocabularies like SNOMED CT [2] (over 360K concepts) or meta-vocabularies like Unified Medical Language System (UMLS)[3] with 3.38 million concepts. Moreover, training corpora can suffer from a "long tail" of training entities where the majority of mentioned entities comprise both a small percentage of the total terms and the language that can describe those terms. This phenomena has been documented multiple times. In Fung et al. [4], they analyze counts of normalized entities from the Electronic Health Records (EHRs) of multiple institutions. They find that to get 95\% coverage of all counts, only 21\% of terms had to be used. In Portelli et al. [5], all three medical normalization datasets (SMM4H [6], CADEC [7], and their proprietary dataset) exhibit long tail issue with MedDRA as the vocabulary as only the top 100-150 terms are represented.

For this dataset, we focus specifically on Disease Entity Recognition (DER) and Disease Entity Normalization (DEN) which can exhibit both diverse mention descriptions. Furthermore, the wide-ranging frequency of diseases means that the number of mentions of diseases in training datasets often falls off extremely quickly as the disease becomes rare. This falloff contributes to many diseases only having a handful to no mentions while only common diseases receive the majority of the mentions. This leads to a lack of quality and comprehensive training datasets as human annotation is both expensive and time consuming.


Methods

LLaMa-2 13B Chat [8] was selected as the LLM to generate synthetic disease mentions, based on its open-source architecture and cost. To create our fine-tuning dataset, we leveraged SemEval 2015 Task 14 [9] dataset which is a version of the MIMIC II [10] corpus that labelled each disease span in the text and normalized them to UMLS. For each disease span in the dataset, we created a instruction output pair. To create the instruction, we selected the instruction from one of two instruction templates, depending on whether the mention had a definition in UMLS.

Additionally, we give the prompt the UMLS alternative preferred English name and the first UMLS provided definition if one is available. To create the output, we would obtain the sentence that contained the disease span and get two sentences before and after it. We supervised fine-tuned with cross entropy loss our LLaMa-2 13B Chat model on our training dataset using Quantized Low Rank Adaptation (Q-LoRa) [11]. We trained our model at 4 bit precision for 2 epochs with a learning rate of 2e-5 with a rank of 64 for our Q-LoRa adapter.

For each UMLS disease concept within the Disorders Semantic Group, we generated 5 different versions of the mention as LLaMa-2 13B to increase the chance of a generating a correctly formatted enclosing<1CUI> tags and to provide diversity in training examples. These tags allow the disease entity to be associated with the corresponding spans of the generated text. To get the initial labels for each synthetic disease mention, we used fuzzy string matching from the Python regex package to extract the disease name from the generated mention. Under the hood, the package adds an edit distance feature to a typical regex package. We searched the synthetic disease mention for the disease mention used in the prompt to generate it. We removed any note where a disease mention could not be identified by fuzzy matching with a budget of 4 edits. After this step, we were left with around 128,000 mentions for 47654 CUIs out of 53432 Disorder Group Concept Unique Identifiers (CUIs) available. For each CUI, an average of 3 unique mentions were generated. All notes were unique and were not copies of the training data.


Data Description

The corpus consists one one dataset: SYNTHETIC_MENTIONS.csv. The dataset is a comma separated (csv) file with two columns: "cui" and "matched_output".

"cui" column contains the UMLS CUI for disease highlighted in the synthetic mention

The "matched_output" column contains the synthetic mention with the disease name highlighted between two tags <1CUI> and </1CUI>.

Descriptive Statistics

  • 128945 Rows
  • 47654 Unique CUIs generated for
  • Range of number of generations per CUI: 1-7
  • Median Number of Generations per CUI: 3
  • Mean Number of Generations per CUI: 2.7
  • Range of Character Length of Generations: 273-3355
  • Median Character Length of Generations: 750
  • Mean Character Length of Generations: 745

Usage Notes

Data can be used directly with any CSV loader in any programming language/software.

Possible applications of this project include:

  1. Training better NER models for clinical and biomedical disease extraction
  2. Training better NEN models for clinical and biomedical disease normalization
  3. Evaluating NER and NEN models on a new dataset

Limitations

  1. Hallucinations
  2. Lower quality than real notes
  3. Possibly lack of realistic context around the disease span
  4. Not all CUIs are covered

All these limitations make them worse quality than the gold version of the notes and could affect the performance of other models not tested in our work.


Ethics

This data was generated using a large language model trained on deidentified notes from SemEval 2015 Task 14. This dataset was derived discharge summaries and radiology reports from the MIMIC II corpus. The MIMIC II database was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the study did not impact clinical care and all protected health information was deidentified.

LLM models are known to hallucinate false or misleading information. Therefore, all rows in the dataset might not be perfectly correct medical information. In addition, the entire dataset is synthetic any resemblance to actual people is unintentional.

LLM training and generation processes were conducted in a secure environment to ensure data safety and privacy. Only authors had access to the cluster and the data before, during, and after the project.


Acknowledgements

We would like to thank UAB Research Computing for use of their HPC infrastructure.

This dataset was created through funding from NIH grant P30AR072583 and R01AG057684. GPU compute was supported by an NVidia GPU Grant Program, "Improving Information Extraction in Biomedical Text".


Conflicts of Interest

N/A


References

  1. Leaman R, Islamaj Doğan R, Lu Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013 Nov 15;29(22):2909–17.
  2. El-Sappagh S, Franda F, Ali F, Kwak KS. SNOMED CT standard ontology based on the ontology for general medical science. BMC Medical Informatics and Decision Making. 2018 Aug 31;18(1):76.
  3. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-270.
  4. Fung KW, McDonald C, Srinivasan S. The UMLS-CORE project: a study of the problem list terminologies used in large healthcare institutions. Journal of the American Medical Informatics Association. 2010 Nov 1;17(6):675–80.
  5. Portelli B, Scaboro S, Santus E, Sedghamiz H, Chersoni E, Serra G. Generalizing over Long Tail Concepts for Medical Term Normalization. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing [Internet]. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022 [cited 2025 Jan 30]. p. 8580–91. Available from: https://aclanthology.org/2022.emnlp-main.588/
  6. Gonzalez-Hernandez G, Klein AZ, Flores I, Weissenbacher D, Magge A, O’Connor K, et al., editors. Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task [Internet]. Barcelona, Spain (Online): Association for Computational Linguistics; 2020 [cited 2025 Jan 30]. Available from: https://aclanthology.org/2020.smm4h-1.0/
  7. Karimi S, Metke-Jimenez A, Kemp M, Wang C. Cadec: A corpus of adverse drug event annotations. J Biomed Inform. 2015 Jun;55:73–81.
  8. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models [Internet]. arXiv; 2023 [cited 2025 Jan 30]. Available from: http://arxiv.org/abs/2307.09288
  9. Elhadad N, Pradhan S, Gorman S, Manandhar S, Chapman W, Savova G. SemEval-2015 Task 14: Analysis of Clinical Text. In: Nakov P, Zesch T, Cer D, Jurgens D, editors. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) [Internet]. Denver, Colorado: Association for Computational Linguistics; 2015 [cited 2025 Jan 30]. p. 303–10. Available from: https://aclanthology.org/S15-2051/
  10. Saeed M, Villarroel M, Reisner A, Clifford G, Lehman L wei, Moody G, et al. MIMIC-II Clinical Database [Internet]. PhysioNet; [cited 2025 Jan 30]. Available from: https://physionet.org/content/mimic-ii/2.6.0/
  11. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs [Internet]. arXiv; 2023 [cited 2025 Jan 30]. Available from: http://arxiv.org/abs/2305.14314

Parent Projects
Synthetic Mention Corpora for Disease Entity Recognition and Normalization was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
Open Data Commons Attribution License v1.0

Corresponding Author
You must be logged in to view the contact information.

Files

Total uncompressed size: 93.0 MB.

Access the files
Folder Navigation: <base>
Name Size Modified
LICENSE.txt (download) 19.9 KB 2025-02-03
README.md (download) 929 B 2024-11-19
SHA256SUMS.txt (download) 312 B 2025-02-03
SYNTHETIC_MENTIONS.csv (download) 93.0 MB 2024-10-08
dd.txt (download) 343 B 2024-11-19