Database Credentialed Access
RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain
Pavel Blinov , Aleksandr Nesterov , Galina Zubkova , Arina Reshetnikova , Vladimir Kokh , Chaitanya Shivade
Published: April 1, 2022. Version: 1.0.0
When using this resource, please cite:
(show more options)
Blinov, P., Nesterov, A., Zubkova, G., Reshetnikova, A., Kokh, V., & Shivade, C. (2022). RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain (version 1.0.0). PhysioNet. https://doi.org/10.13026/gxzd-cf80.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
There is a shortage of text medical resources for the Russian language. This is a substantial obstacle in state-of-the-art NLP deep learning models research and development. To mitigate this issue we translated the MedNLI data from English to Russian. The RuMedNLI task became part of the RuMedBench: Russian Medical Language Understanding Benchmark.
MedNLI is a dataset based on MIMIC-III records and annotated by doctors performing a natural language inference task (NLI) grounded in the medical history of patients. RuMedNLI is the full counterpart dataset of MedNLI in the Russian language.
Background
The rapid development of Natural Language Processing (NLP) technologies has required more challenging and intellectual tasks to check models' ability of text "understanding". One recent candidate for such a task is the Natural Language Inference (NLI). Given a pair of texts as input, the task is to determine their relationship. The relation is restricted to the three most basic forms:
- entailment (whether a second text can be viewed as a consistent continuation of the first one);
- contradiction (a second text conflict with the first one);
- neutral (given texts are unrelated and cannot be assigned to entailment or contradiction).
By measuring an NLP method's accuracy in the NLI task, we can rate its ability of text understanding.
The task becomes arduous in a domain-specific environment, for example, medical text full of slang phrases, abbreviations, drug names, etc. That requires specialized datasets and models. The paper [1] designed such a dataset for the NLI task in the medical domain called MedNLI [1]. The dataset is based on MIMIC-III [2] electronic health records. Text extracts from the Past Medical History section annotated by doctors concerning three relations (entailment, contradiction, or neutral).
The MedNLI data was originally presented in English [3]. Given strong language dependency in the NLP area, this fact limits the applicability of the corpus for the other languages. To lessen this issue and introduce the data to the broader research community, we present the Russian Medical Natural Language Inference (RuMedNLI) dataset [4].
Methods
RuMedNLI is the full counterpart dataset of MedNLI in Russian language. For the original MedNLI dataset translation, we use the following procedure.
First, each text was independently processed by two automatic translation services by Google [5] and DeepL [6]. We chose these services based on a preliminary manual quality check. Both services implement modern neural machine translation methods. The approaches differ, however, adding diversity to the translations, which is valuable for the next step.
The second step is the assembly of the final translation. Each pair of translations were thoroughly reviewed by a native Russian speaker with proficiency in English, though without specialist medical education. The reviewer was allowed to consult with the medical team in complex cases to compensate for the lack of domain knowledge. The medical team consisted of three members with broad experience in the field:
- One with purely medical background and 8+ years of therapeutic work experience in Russian clinics (MD, PhD Medicine);
- One pharmacist with international work experience in the pharmaceutical industry;
- One specialist in medical and biological cybernetics (MD).
The typical correction scenario involves selecting the best translated candidate as a base and modify it accordingly. There are several distinct groups of errors and corrections:
- Direct translation errors. Many errors are produced because translation is performed in a general context, not in medical one. For example, the phrase "delivery was at term" is viewed as a message about post package delivery instead of childbirth. Another example is the correction of the use of the word "floor" like "On transfer to floor, pt. denies cp" versus "He was found by his son on the floor". All such texts were rephrased to a correct and more natural form.
- Abbreviation errors. The texts are full of contractions (pt - patient; cp - chest pain; 36 yo M - 36 years old man; etc.), medical abbreviations (CABG, HTN, DM2, CKD, etc.), procedure and drug names, etc. Most of those terms were initially mistranslated and needed to be fixed or adapted to appropriate Russian terms.
- Measurement corrections. All encountered measurement values transformed to the metric system units as in Russia it is used as primary. Here are some examples:
- body temperature values from Fahrenheit to Celsius;
- lengths from feet to meters;
- weights from pounds to kilograms;
- vehicular speeds in accident descriptions from miles to kilometers;
- Blood groups from ABO system to numeric, etc.Cultural and language phenomena replacement. We took the liberty to correct some text to better represent Russian-speaking countries' realities. For example, baseball is primarily popular only in the US; we replace it with football.
It is worth mentioning that of the 14,930 unique English text fragments (premises, entailment, contradiction, and neutral sentences) only 4,820 translations were left unchanged; the remaining 10,110 were corrected in some way. Based on this, we can estimate the automatic translation quality for the medical domain is 32.3%.
Data Description
The dataset consists of three JSON files:
ru_mli_train_v1.jsonl
: The train set;ru_mli_dev_v1.jsonl
: The development set;ru_mli_test_v1.jsonl
: The test set.
As in the original data each line in the train/development/test set is a json consisting of the following keys:
gold_label
: NLI label is one of entailment, contradiction, or neutral;sentence1
: premise statement;sentence2
: hypothesis statement;sentence1 parse
: constituency parse of the premise using Stanford parser;sentence2 parse
: constituency parse of the hypothesis using Stanford parser;sentence1 binary parse
: binary parse of the premise using Stanford parser;sentence2 binary parse
: binary parse of the hypothesis using Stanford parser.
Moreover, each record is enriched with translated version of original sentence1 and sentence2 values:
ru_sentence1
: premise statement in Russian;ru_sentence2
: hypothesis statement in Russian.
The three NLI labels are defined as follows:
entailment
: hypothesis can be inferred from the premise;contradiction
: hypothesis can NOT be inferred from the premise;neutral
: inference relation of the premise and the hypothesis is undetermined.
An example of the dataset is shown below:
{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb94b8-66c7-11e7-a8dc-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( elevated Cr ) ) )", "gold_label": "entailment", "ru_sentence1": "Анализы отличались содержанием креатина 1,7 (исходный уровень 0,5 по старым записям) и лактата 2,4.", "ru_sentence2": "У пациента повышен креатин"}
Usage Notes
We hope our work will be helpful for the Russian and multilingual medical research community. The RuMedNLI dataset provides a starting point for development of Russian-language models, and supports the assessment of models for a well-established NLI task.
Although our dataset is likely to contain some errors, we hope that it nurtures research and development. We intend to release improved versions of this dataset as required, and we also encourage the community to share complementary datasets.
Ethics
The authors declare no ethics concerns.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Romanov, A., & Shivade, C. (2018). Lessons from Natural Language Inference in the Clinical Domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 1586-1596).
- Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.
- Shivade, C. (2019). MedNLI - A Natural Language Inference Dataset For The Clinical Domain (version 1.0.0). PhysioNet. https://doi.org/10.13026/C2RS98.
- Blinov, P., Reshetnikova, A., Nesterov, A., Zubkova, G., Kokh, V. RuMedBench: A Russian Medical Language Understanding Benchmark. arXiv:2201.06499v1 [Preprint]. 2022 [cited 2022 Feb 18]: [10 p.]. Available from: https://arxiv.org/abs/2201.06499.
- Google Translate [Internet]. [cited 2022 February 18]. Available from: http://translate.google.com.
- DeepL Translator [Internet]. [cited 2022 February 18]. Available from: https://www.deepl.com/translator.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/gxzd-cf80
DOI (latest version):
https://doi.org/10.13026/8nyz-cn98
Topics:
natural language inference
recognizing textual entailment
russian language
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project