Database Credentialed Access
EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records
Yeonsu Kwon , Jiho Kim , Gyubok Lee , Seongsu Bae , Daeun Kyung , Wonchul Cha , Tom Pollard , Alistair Johnson , Edward Choi
Published: Aug. 19, 2024. Version: 1.0.0
When using this resource, please cite:
(show more options)
Kwon, Y., Kim, J., Lee, G., Bae, S., Kyung, D., Cha, W., Pollard, T., Johnson, A., & Choi, E. (2024). EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records (version 1.0.0). PhysioNet. https://doi.org/10.13026/2nb5-nf74.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Electronic Health Records (EHRs) are integral for storing comprehensive patient medical records, combining structured data (e.g., medications) with detailed clinical notes (e.g., physician notes). These elements are essential for straightforward data retrieval and provide deep, contextual insights into patient care. However, they often suffer from discrepancies due to unintuitive EHR system designs and human errors, posing serious risks to patient safety. To address this, we developed EHRCon, a new dataset and task specifically designed to ensure data consistency between structured tables and unstructured notes in EHRs. EHRCon was crafted in collaboration with healthcare professionals using the MIMIC-III EHR dataset, and includes manual annotations of 3,943 entities across 105 clinical notes checked against database entries for consistency. EHRCon has two versions, one using the original MIMIC-III schema, and another using the OMOP CDM schema, in order to increase its applicability and generalizability.
Background
Electronic Health Records (EHRs) are digital datasets comprising the rich information of a patient’s medical history within hospitals. These records integrate both structured data (e.g., medications, diagnoses) and detailed clinical notes (e.g., physician notes). The structured data facilitates straightfor- ward retrieval and analysis of essential information, while clinical notes provide in-depth, contextual insights into the patient’s condition. These two forms of data are interconnected and provide comple- mentary information throughout the diagnostic and treatment processes. For example, a practitioner might start by reviewing test results stored in the database, then determine a diagnosis and formulate a treatment plan, which are documented in the clinical notes. These notes are subsequently used to update the structured data in the database.
However, inconsistencies can arise between the two sets of data for several reasons. One primary issue is that EHR interfaces are often designed with a focus on administrative and financial tasks, which makes it difficult to accurately document clinical information [1]. Additionally, overburdened practitioners might unintentionally introduce errors by importing incorrect medication lists, copying and pasting outdated records, or entering inaccurate test results [2, 3, 4]. These errors can lead to significant discrepancies between the structured data and clinical notes in the EHR, potentially jeopardizing patient safety and leading to legal complications [5].
Manual scrutiny of these records is both time-intensive and costly, underscoring the necessity for automated interventions. Despite the need for automated systems, previous studies on consistency check between tables and text have primarily focused on single claims and small-scale single tables [6, 7, 8, 9]. These approaches are not designed for the complex and large-scale nature of EHRs, which require more comprehensive and scalable solutions.
Methods
Annotators begin by carefully reviewing the clinical notes, utilizing web searches and discussions with GPT-4 (0613). Through this process, they identify entities and relevant information within the notes. Subsequently, the identified entities are classified into Type 1 and Type 2. For each entity, annotators use the Item Search Tool to find the relevant items in the database. They then select the items and tables associated with the entity. If none of the retrieved items match the entity, the annotators manually find and match the appropriate items. Following this, the annotators extract information related to the entity from the notes (e.g., dates, values, units) and use them to generate SQL queries. Finally, the annotators execute the generated queries and review the results to label the entity as either CONSISTENT
or INCONSISTENT
. If a query yields no results, the SQL conditions are sequentially masked and executed to pinpoint the source of the inconsistency. Also, when the annotators encounter a corner case that is not addressed in the existing instructions, they update the instructions after thorough discussion. Upon completing all annotations, the annotators engaged in a post-processing phase to ensure high-quality data. This phase involved additional annotation of entities according to the final labeling instructions, as well as the removal of any misaligned entities.
Data Description
We propose a new task and dataset called EHRCon, which is designed to verify the consistency between clinical notes and large-scale relational databases in EHRs. We collaborated closely with practitioners to design labeling instructions based on their insights and expertise, authentically reflecting real hospital environments. Based on these labeling instructions, trained human annotators used the MIMIC-III EHR dataset [10] to manually compare 3,943 entities mentioned in 105 clinical notes against corresponding table contents, annotating them for CONSISTENT or INCONSISTENT. Our dataset also offers interpretability by including detailed information about the specific tables and columns where inconsistencies occurred. Moreover, it contains two versions, one based on the original MIMIC-III schema, and another based on its OMOP CDM [11] implementation, allowing us to incorporate various schema types and enhance the generalizability.
1. File Structure
The folder structure is shown below:
base
├─ original
├─ test
├─ discharge
├─ discharge_label
└─ EHRCon_{hadm_id}_data.pkl
└─ discharge_test.csv
├─ nursing
└─ physician
└─ valid
└─ processed
Where:
base
is the root directory. Within this directory, there are two main subdirectories:original
andprocessed
.- The
original
directory includes data for consistency checks with MIMIC original notes - The
processed
directory contains data that has been filtered to remove information not present in the EHR tables. - All directories (
original
andprocessed
) within dataset follow the same structure, containing test and valid subdirectories, which further includedischarge
,nursing
, andphysician
directories with relevant label (EHRCon_{hadm_id}_data.pkl
) and CSV files. The CSV files contain clinical notes.
2. Structure of EHRCon (.pkl)
Files contain the following variables:
Row_id
: The ROW_ID of the clinical noteentity
: The entity that is the subject of the consistency checkdata
:The values related to the entity extracted from the clinical note and mapped to the columns of the tablelabel
: Indicates whether the content related to the entity in the note is consistent with the table, with details on which columns have inconsistencies, if any are found.errors
: The number of columns with inconsistenciesposition
: The line number of the entity in the notesource
: mimicentity type
:- 1: entities with numerical values
- 2: entities without values but who use existence can be verified in the database
- 3: entities with string values
all_queries
: A list of all value pairs checked to determine the label for the entity
The JSON structure is shown below:
{"Row_id": [{"entity": {"data": [{'table_name1': {'column_name1': 'value'}}},
"label": '"charttime" and "valuenum" are inconsistency',
"errors": 2}],
"position": '4',
"source": 'mimic',
"entity_type": '1',
"all_queries":[…]}
3. How to load the dataset?
The data can be loaded into Python with the following commands:
import pickle
with open("EHRCon_{hadm_id}_data.pkl", 'rb') as f:
data = pickle.load(f)
import pandas as pd
notes = pd.read_csv("discharge_test.csv")
Usage Notes
We meticulously compared the clinical notes with their corresponding data records, and annotated our findings. The resulting dataset has potential to support development of EHR systems, as well as algorithms that explore patients safety and quality of care. The dataset lays the groundwork for advances in automated and dependable healthcare documentation systems. Detailed information about the dataset and how to use it can be found on the EHRCon website, GitHub repository, and in the paper [12, 13].
Limitations
Despite the careful design of our dataset, several limitations exist. First, EHRCon is derived from MIMIC-III, which has been preprocessed to protect patient privacy and to facilitate reuse. This preprocessing may have introduced inconsistencies that do not occur within actual EHR data. Secondly, while we took great care in the annotation process, there are limitations in our ability to verify the contents of all clinical notes in MIMIC-III. To cover a broader range of cases, more scalable methods are needed.
Release Notes
1.0.0 - Initial Release
Ethics
The authors have no ethics statement to declare.
Acknowledgements
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant (No.RS-2019-II190075) and Korea Medical Device Development Fund grant (Project Number: 1711138160, KMDF_PR_20200901_0097), funded by the Korea government (MOTIE, MOHW, MFDS).
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Luis Bernardo Villa and Ivan Cabezas. A review on usability features for designing electronic health records. In 2014 IEEE 16th international conference on e-health networking, applications and services (Healthcom), pages 49–54. IEEE, 2014.
- Sigall K Bell, Tom Delbanco, Joann G Elmore, Patricia S Fitzgerald, Alan Fossa, Kendall Harcourt, Suzanne G Leveille, Thomas H Payne, Rebecca A Stametz, Jan Walker, et al. Frequency and types of patient- reported errors in electronic health record ambulatory care notes. JAMA network open, 3(6):e205867–e205867, 2020.
- Thomas H Payne, W David Alonso, J Andrew Markiel, Kevin Lybarger, Ross Lordon, Meliha Yetisgen, Jennifer M Zech, and Andrew A White. Using voice to create inpatient progress notes: effects on note timeliness, quality, and physician satisfaction. JAMIA open, 1(2):218–226, 2018.
- Siddhartha Yadav, Noora Kazanji, Narayan KC, Sudarshan Paudel, John Falatko, Sandor Shoichet, Michael Maddens, and Michael A Barnes. Comparison of accuracy of physical examination findings in initial progress notes between paper charts and a newly implemented electronic health record. Journal of the American Medical Informatics Association, 24(1):140–144, 2017.
- Hassan A Aziz and Ola Asaad Alsharabasi. Electronic health records uses and malpractice risks. Clinical Laboratory Science, 28(4):250–255, 2015.
- Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. Feverous: Fact extraction and verification over unstructured and structured information. arXiv preprint arXiv:2106.05707, 2021.
- Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019.
- Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, and Vivek Srikumar. INFOTABS: Inference on tables as semi-structured data. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2309–2324, Online, July 2020. Association for Computational Linguistics.
- Nancy X. R. Wang, Diwakar Mahajan, Marina Danilevsky, and Sara Rosenthal. SemEval-2021 task 9: Fact verification and evidence finding for tabular data in scientific documents (SEM-TAB-FACTS). In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 317–326, Online, August 2021.
- Alistair E. W. Johnson, Tom J. Pollard, and Roger G. Mark. MIMIC-III clinical database (version 1.4), 2016.
- Erica Voss, Rupa Makadia, Amy Matcho, Qianli Ma, Chris Knoll, Martijn Schuemie, Frank Defalco, Ajit Londhe, Vivienne Zhu, and Patrick Ryan. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. Journal of the American Medical Informatics Association : JAMIA, 22, 02 2015.
- EHRCon GitHub Repository. http://github.com/dustn1259/EHRCon [Accessed: 19 Aug 2024]
- Yeonsu Kwon, Jiho Kim, Gyubok Lee, Seongsu Bae, Daeun Kyung, Wonchul Cha, Tom Pollard, Alistair Johnson, Edward Choi. EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records. 24 Jun 2024. https://arxiv.org/abs/2406.16341
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/2nb5-nf74
DOI (latest version):
https://doi.org/10.13026/x1ea-np24
Project Website:
https://ehrcon.github.io
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project