Name: Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy
Published: Dec. 11, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Restricted Access

Pietro Mascagni , Deepak Alapatt , Aditya Murali , Armine Vardazaryan , Alain Garcia Vazquez , Nariaki Okamoto , Guido Costamagna , Didier Mutter , Jacques Marescaux , Bernard Dallemagne , Nicolas Padoy

Published: Dec. 11, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Mascagni, P., Alapatt, D., Murali, A., Vardazaryan, A., Garcia Vazquez, A., Okamoto, N., Costamagna, G., Mutter, D., Marescaux, J., Dallemagne, B., & Padoy, N. (2024). Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy (version 1.0.0). PhysioNet. https://doi.org/10.13026/czwq-jh81.

MLA	Mascagni, Pietro, et al. "Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy" (version 1.0.0). PhysioNet (2024), https://doi.org/10.13026/czwq-jh81.
APA	Mascagni, P., Alapatt, D., Murali, A., Vardazaryan, A., Garcia Vazquez, A., Okamoto, N., Costamagna, G., Mutter, D., Marescaux, J., Dallemagne, B., & Padoy, N. (2024). Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy (version 1.0.0). PhysioNet. https://doi.org/10.13026/czwq-jh81.
Chicago	Mascagni, Pietro, Alapatt, Deepak, Murali, Aditya, Vardazaryan, Armine, Garcia Vazquez, Alain, Okamoto, Nariaki, Costamagna, Guido, Mutter, Didier, Marescaux, Jacques, Dallemagne, Bernard, and Nicolas Padoy. "Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy" (version 1.0.0). PhysioNet (2024). https://doi.org/10.13026/czwq-jh81.
Harvard	Mascagni, P., Alapatt, D., Murali, A., Vardazaryan, A., Garcia Vazquez, A., Okamoto, N., Costamagna, G., Mutter, D., Marescaux, J., Dallemagne, B., and Padoy, N. (2024) 'Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/czwq-jh81.
Vancouver	Mascagni P, Alapatt D, Murali A, Vardazaryan A, Garcia Vazquez A, Okamoto N, Costamagna G, Mutter D, Marescaux J, Dallemagne B, Padoy N. Endoscapes2023, A Critical View of Safety and Surgical Scene Segmentation Dataset for Laparoscopic Cholecystectomy (version 1.0.0). PhysioNet. 2024. Available from: https://doi.org/10.13026/czwq-jh81.

Additionally, please cite the original publication:

Murali, A., Alapatt, D., Mascagni, P., Vardazaryan, A., Garcia, A., Okamoto, N., ... & Padoy, N. (2023). Latent graph representations for critical view of safety assessment. IEEE Transactions on Medical Imaging.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Minimally invasive image-guided surgery heavily relies on vision. Deep learning models for surgical video analysis can support surgeons in visual tasks such as assessing the critical view of safety (CVS) in laparoscopic cholecystectomy, potentially contributing to surgical safety and efficiency. However, the performance, reliability, and reproducibility of such models are deeply dependent on the availability of data with high-quality annotations. To this end, we release Endoscapes2023, a dataset comprising 201 laparoscopic cholecystectomy videos with regularly spaced frames annotated with segmentation masks of surgical instruments and hepatocystic anatomy, as well as assessments of the criteria defining the CVS by three trained surgeons following a public protocol. Endoscapes2023 enables the development of models for object detection, semantic and instance segmentation, and CVS prediction, contributing to safe laparoscopic cholecystectomy.

Background

Image-guided procedures such as laparoscopic, endoscopic, and radiological interventions heavily rely on imaging [1]. Such minimally-invasive approaches have shown significant value for patients and healthcare systems. However, the heuristic nature of human visual perception can lead to erroneous interpretations of endoscopic and radiological images, contributing to serious adverse events that were rarer during traditional, open surgery procedures [2].

In laparoscopic cholecystectomy (LC), an abdominal surgical procedure performed by most surgeons, human heuristic tricks operators into believing that the funnel-shaped structure in continuity with the gallbladder is the cystic duct while, at times, it is the common bile duct [2]. On these occasions, surgeons can unintentionally transect the common bile duct, thereby causing a major bile duct injury (BDI), an adverse event resulting in a threefold increase in mortality in 1 year as well as major costs for surgeons and healthcare systems [3, 4].

To prevent this visual perceptual illusion causing 97% of major BDI [2], in 1995 Strasberg et al proposed the so-called critical view of safety (CVS) to conclusively identify the cystic duct [5](further described below in Methods, CVS annotation). Currently, all guidelines on safe LC recommend obtaining the CVS [6-8]; however, the rate of BDI remains stable today[9]. A potential reason for this discrepancy is that CVS assessment is qualitative and subject to observer interpretation[10, 11].

Over the past decade, surgical data science [12, 13] teams from academia as well as from industry have been actively developing deep learning models for safe laparoscopic cholecystectomy, spanning from workflow analysis to intraoperative guidance and postoperative documentation [14-17]. Automated CVS assessment was first tackled by our prior work presenting DeepCVS, a method for joint anatomical segmentation and CVS prediction [17]. For this initial work, 2854 images showing anterior views of hepatocystic anatomy were hand-picked from 201 LC videos and annotated by a surgeon with a binary assessment of the three criteria defining the CVS.

A subset of 402 images balancing optimal and suboptimal CVS achievements were further annotated with semantic segmentation masks. This initial dataset, while instrumental in enabling a proof-of-concept study, was characterized by two critical limitations: (1) the frame selection process introduced bias during both training and evaluation and (2) the small dataset limited testing to only 571 images for CVS assessment and 80 images for semantic segmentation.

Endoscapes2023 is a vastly expanded dataset from the same 201 videos used for DeepCVS development. Following a published annotation protocol [18], the segment of the video that is relevant for CVS assessment was first isolated, then frames were extracted from this segment at 1 frame per second; this process eliminates the selection bias of the DeepCVS dataset.

Recognizing the need for robust public datasets to foster research and development for surgical safety, our surgical data science team of computer scientists and surgeons has built and publicly release Endoscapes2023, the first sizable dataset with frame-wise annotations of CVS criteria assessments by three independent surgeons and scene segmentation.

Methods

Data collection

Surgical videos were prospectively recorded at the Institute of Image-Guided Surgery (IHU-Strasbourg, France) between February 10, 2015, and June 7, 2019. As detailed in prior works [17], 201 full-length endoscopic videos, each corresponding to a unique patient, of standard LC performed by 22 surgeons (10 attending surgeons and 12 surgical residents) in patients older than 18 years operated for benign conditions were retrospectively collected following patients written consent. Videos were not further selected. Video preprocessing steps included: merging files of the same case, converting to 30 frames per seconds (FPS), and re-encoded at a resolution of 480x854 with H.264 codec.

The Ethics Committee of the University of Strasbourg approved the data collection within the “SafeChole – Surgical data science for safe laparoscopic cholecystectomy” study (ID F20200730144229). According to the French MR004 reference methodology and GDPR, all data was collected with patient consent and can be shared for research purposes following de-identification.

Video editing and frame extraction

A surgeon annotated the time of the first incision on the hepatocystic triangle (“Start” timestamp) and when the first clip was applied on either the cystic duct or the cystic artery (“End” timestamp) to define the region of video of interest for CVS assessment. Since in most cases, some dissection is necessary before starting to visualize CVS criteria, a surgeon also annotated the moment when any of the 3 CVS criteria became evaluable (i.e., when the anatomical structures defining a CVS criteria become visible, herein referred to as “Criterion x evaluable” timestamp). While of potential interest for CVS assessment, the segment of video before the “Criterion x evaluable” timestamp most likely does not show any CVS criteria having been achieved and hence can be excluded from explicit annotation to decrease the annotation cost. Finally, the 201 LC videos were edited down to shorter video clips showing only from the “Criterion x evaluable” to the “End” timestamp. The average duration of these video clips is 283.1 ± 275.9 seconds. From this video clips, extracting 1 frame per second resulted in a dataset of 58813 frames.

CVS annotation

According to Strasberg et al. [5], CVS is defined as the view of 2 and only 2 tubular structures, the cystic duct and the cystic artery, connected to the gallbladder (2-structure criterion, C1), a hepatocystic triangle cleared from fat and connective tissues (hepatocystic triangle criterion, C2) and the lower part of the gallbladder separated from the liver bed (cystic plate criterion, C3). For consistency, each of these 3 CVS criteria was assessed as either achieved or not, in a binary fashion; if 3 out of 3 criteria were achieved, then CVS was considered obtained [10].

Video annotation with binary assessments of CVS criteria was performed on all 201 LC cases by a trained surgeon using Excel (Microsoft Corporation, USA) spreadsheets. CVS was achieved in 84 (41.8%) of the included procedures, with C1, C2, and C3 being achieved in 170 (84.6%), 122 (60.7%), and 111 (55.2%) of the cases respectively.

Additionally, one frame every 5 seconds was sampled for manual annotation of CVS criteria by three independent surgeons trained to annotate CVS criteria. For computing statistics and evaluating models, the most common assessment of each of the CVS criteria in each image was defined as the consensus ground truth. In the resulting dataset of 11090 frames, named Endoscapes-CVS201, CVS was achieved in 678 (6.1%) of the annotated frames, with C1, C2, and C3 being achieved in 1899 (17.1%), 1380 (12.4%), and 2124 (19.2%) of the frames respectively. Cohen’s kappa for inter-rater agreements for CVS, C1, C2, and C3 were respectively 0.38, 0.33, 0.53, 0.44; here, Cohen’s kappa values were first computed between each pair of annotators, then averaged across the three possible pairs for each of the three criteria and overall CVS.

Spatial annotation

The gallbladder, the cystic duct, the cystic artery, the cystic plate, the dissection in the hepatocystic triangle (i.e., the “windows” between the cystic duct and the cystic artery, and between the cystic artery and the cystic plate) and surgical tools are the elements in the surgical scene evaluated for CVS assessment.

These 6 classes were manually segmented by 5 trained annotators, 3 surgeons and 2 computer scientists with domain-specific expertise, on 1 frame every 30 seconds, resulting in a dataset of 1933 frames with semantic segmentation. A subset of these, 493 frames from 50 LC cases selected as described in the associated technical report¹⁹, are publicly released as Endoscapes-Seg50. Finally, a dataset including 1933 frames with bounding boxes automatically extracted from the segmentation masks is being released as Endoscapes-BBox201.

The number and percentage of the 1933 frames that contain each spatial class is as follows: Gallbladder (1821, 94.2%), Cystic Duct (1478, 76.5%), Cystic Artery (1020, 52.8%), Cystic Plate (701, 36.3%), Hepatocystic Triangle Dissection (683, 35.3%), Tool (1771, 91.6%). A frame is considered to contain a spatial class if there is at least one bounding box with said class label.

Similarly, the number and percentage of the 493 frames that contain each spatial class is as follows: Gallbladder (466, 94.5%), Cystic Duct (374, 75.9%), Cystic Artery (283, 57.4%), Cystic Plate (192, 38.9%), Hepatocystic Triangle Dissection (146, 29.6%), Tool (450, 91.3%). A frame is considered to contain a spatial class if there is at least one pixel in the semantic segmentation mask with said class label.

Data Description

We have published a technical report [19] which, beside describing dataset splits used in prior works and reporting extensive baselines to benchmark a variety of models, provides a comprehensive description of the file structure and annotation format; we summarize the key points here.

All extracted frames at 1 frame per second can be found in a directory entitled ‘all’.

The frames are also divided into training, validation, and testing subsets in correspondingly named folders to facilitate model development.

Semantic segmentation masks for the 50-video subset, namely Endoscapes-Seg50, are included in a folder named ‘semseg’. Instance segmentation masks are stored in a folder named ‘insseg’; each instance segmentation mask consists of two files – one NPY file containing a stack of binary masks where each binary mask represents an instance, as well as one CSV file containing the label of each instance.

Bounding box annotations and CVS annotations are provided for each data split (training/validation/testing) in two different COCO-style JSON files: ‘annotation_coco.json’ for the bounding boxes, and ‘annotation_ds_coco.json’ for the CVS labels (see [19] for more details).

Lastly, all CVS annotations are additionally summarized in a single CSV file called ‘all_metadata.csv’. This file contains the CVS criteria labels by each annotator for each frame as well as the consensus CVS criteria labels (average of 3 annotators) for each frame.

We describe the file structure below.

$DATA_HOME
└── train # All training frames at 1 fps
    ├── 1_29375.jpg # SYNTAX: ${VIDEO_ID}_{FRAME_NUM}.jpg
    ├── ...
    ├── 120_85800.jpg
    ├── annotation_coco.json # Annotation File with Bounding Boxes for 1 frame every 30s
    ├── annotation_ds_coco.json # Annotation File with CVS labels for 1 frame every 5s
└── val # All validation frames at 1 fps
    ├── 121_14850.jpg
    ├── ...
    ├── 161_34325.jpg
    ├── annotation_coco.json
    ├── annotation_ds_coco.json
└── test # All testing frames at 1 fps
    ├── 162_5850.jpg
    ├── ...
    ├── 201_46125.jpg
    ├── annotation_coco.json
    ├── annotation_ds_coco.json
└── train_seg # Training frames for EndoscapesSeg50 at 1 fps
    ├── 4_21725.jpg
    ├── ...
    ├── 119_58000.jpg
    ├── annotation_coco.json # Annotation File with Instance Segmentation Masks for 1 frame every 30s
└── val_seg # Validation frames for EndoscapesSeg50 at 1 fps
    ├── 126_10825.jpg
    ├── ...
    ├── 159_60875.jpg
    ├── annotation_coco.json
└── test_seg # Testing frames for EndoscapesSeg50 at 1 fps
    ├── 165_22925.jpg
    ├── ...
    ├── 189_34875.jpg
    ├── annotation_coco.json
└── insseg # Instance segmentation masks SYNTAX: ${VIDEO_ID}_{FRAME_NUM}.npy/csv
    ├── 5_5900.npy # stack of instance masks, where each channel corresponds to an object instance
    ├── 5_5900.csv # label of each slice of numpy mask
    ├── ...
└── semseg # Semantic segmentation masks SYNTAX: ${VIDEO_ID}_{FRAME_NUM}.png
    ├── 5_5900.png # 480x854x3 image with each element representing the class_id of each pixel
    ├── ...
└── all_metadata.csv # CSV file with CVS annotations by each annotator
└── vid_cvs.csv # CSV file with Video-level CVS assessments
└── train_vids.txt # 120 training videos for Endoscapes-CVS201 and Endoscapes-BBox201
└── val_vids.txt # 41 validation videos for Endoscapes-CVS201 and Endoscapes-BBox201
└── test_vids.txt # 40 testing videos for Endoscapes-CVS201 and Endoscapes-BBox201
└── train_seg_vids.txt # 30 training videos for Endoscapes-Seg50
└── val_seg_vids.txt # 10 validation videos for Endoscapes-Seg50
└── test_seg_vids.txt # 10 testing videos for Endoscapes-Seg50
└── seg_label_map.txt # map from class name to id; note that background is ignored for object detection/instance segmentation; for these tasks, id 0 becomes cystic plate, ...

Usage Notes

The plethora of unlabeled images extracted from the most critical part of LC procedures, the many images with CVS criteria assessments, the low-frequency frames with bounding box annotations, and the relatively small subset with expensive segmentation masks not only effectively simulate realistic annotation budgets but also enable a huge diversity of experiments concerning mixed-supervision (e.g., combining image-level and instance-level annotations), semi-supervision (i.e., using both labeled and unlabeled data to train), and temporal modeling.

Release Notes

Version 1.0.0: First version of this dataset

Ethics

Acknowledgements

This study was partially supported by French State Funds managed by the Agence Nationale de la Recherche (ANR) under grants ANR-20-CHIA-0029-01 (National AI Chair AI4ORSafety) and ANR-10-IAHU-02 (IHU Strasbourg).

Conflicts of Interest

The Authors report no conflicts of interest.

References

Mascagni P, Longo F, Barberio M, Seeliger B, Agnus V, Saccomandi P, Hostettler A, Marescaux J, Diana M. New intraoperative imaging technologies: Innovating the surgeon’s eye toward surgical precision. Journal of surgical oncology. 2018 Aug;118(2):265-82.
Way LW, Stewart L, Gantert W, Liu K, Lee CM, Whang K, Hunter JG. Causes and prevention of laparoscopic bile duct injuries: analysis of 252 cases from a human factors and cognitive psychology perspective. Annals of surgery. 2003 Apr 1;237(4):460-9.
Rogers Jr SO, Gawande AA, Kwaan M, Puopolo AL, Yoon C, Brennan TA, Studdert DM. Analysis of surgical errors in closed malpractice claims at 4 liability insurers. Surgery. 2006 Jul 1;140(1):25-33.
Berci G, Hunter J, Morgenstern L, Arregui M, Brunt M, Carroll B, Edye M, Fermelia D, Ferzli G, Greene F, Petelin J. Laparoscopic cholecystectomy: first, do no harm; second, take care of bile duct stones. Surgical endoscopy. 2013 Apr;27:1051-4.
Strasberg SM, Hertl M, Soper NJ. An analysis of the problem of biliary injury during laparoscopic cholecystectomy. Journal of the American College of Surgeons. 1995;180(1):101-25.
Michael Brunt L, Deziel DJ, Telem DA, Strasberg SM, Aggarwal R, Asbun H, Bonjer J, McDonald M, Alseidi A, Ujiki M, Riall TS. Safe cholecystectomy multi-society practice guideline and state-of-the-art consensus conference on prevention of bile duct injury during cholecystectomy. Surgical endoscopy. 2020 Jul;34:2827-55.
Conrad C, Wakabayashi G, Asbun HJ, Dallemagne B, Demartines N, Diana M, Fuks D, Giménez ME, Goumard C, Kaneko H, Memeo R. IRCAD recommendation on safe laparoscopic cholecystectomy. Journal of Hepato‐Biliary‐Pancreatic Sciences. 2017 Nov;24(11):603-15.
Wakabayashi G, Iwashita Y, Hibi T, Takada T, Strasberg SM, Asbun HJ, Endo I, Umezawa A, Asai K, Suzuki K, Mori Y. Tokyo Guidelines 2018: surgical management of acute cholecystitis: safe steps in laparoscopic cholecystectomy for acute cholecystitis (with videos). Journal of Hepato‐biliary‐pancreatic Sciences. 2018 Jan;25(1):73-86.
Pucher PH, Brunt LM, Davies N, Linsk A, Munshi A, Rodriguez HA, Fingerhut A, Fanelli RD, Asbun H, Aggarwal R. SAGES Safe Cholecystectomy Task Force. Outcome trends and safety measures after 30 years of laparoscopic cholecystectomy: a systematic review and pooled data analysis. Surg Endosc. 2018 May;32(5):2175-83.
Mascagni P, Fiorillo C, Urade T, Emre T, Yu T, Wakabayashi T, Felli E, Perretta S, Swanstrom L, Mutter D, Marescaux J. Formalizing video documentation of the Critical View of Safety in laparoscopic cholecystectomy: a step towards artificial intelligence assistance to improve surgical safety. Surgical endoscopy. 2020 Jun;34:2709-14.
Nijssen MA, Schreinemakers JM, Meyer Z, Van Der Schelling GP, Crolla RM, Rijken AM. Complications after laparoscopic cholecystectomy: a video evaluation study of whether the critical view of safety was reached. World journal of surgery. 2015 Jul;39:1798-803.
Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S, Hashizume M. Surgical data science for next-generation interventions. Nature Biomedical Engineering. 2017 Sep;1(9):691-6.
Maier-Hein L, Eisenmann M, Sarikaya D, März K, Collins T, Malpani A, Fallert J, Feussner H, Giannarou S, Mascagni P, Nakawala H. Surgical data science–from concepts toward clinical translation. Medical image analysis. 2022 Feb 1;76:102306.
Mascagni P, Alapatt D, Sestini L, Altieri MS, Madani A, Watanabe Y, Alseidi A, Redan JA, Alfieri S, Costamagna G, Boškoski I. Computer vision in surgery: from potential to clinical value. npj Digital Medicine. 2022 Oct 28;5(1):163.
Madani A, Namazi B, Altieri MS, Hashimoto DA, Rivera AM, Pucher PH, Navarrete-Welton A, Sankaranarayanan G, Brunt LM, Okrainec A, Alseidi A. Artificial intelligence for intraoperative guidance: using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy. Annals of surgery. 2022 Aug 1;276(2):363-9.
Mascagni P, Alapatt D, Laracca GG, Guerriero L, Spota A, Fiorillo C, Vardazaryan A, Quero G, Alfieri S, Baldari L, Cassinotti E. Multicentric validation of EndoDigest: a computer vision platform for video documentation of the critical view of safety in laparoscopic cholecystectomy. Surgical Endoscopy. 2022 Nov;36(11):8379-86.
Mascagni P, Vardazaryan A, Alapatt D, Urade T, Emre T, Fiorillo C, Pessaux P, Mutter D, Marescaux J, Costamagna G, Dallemagne B. Artificial intelligence for surgical safety: automatic assessment of the critical view of safety in laparoscopic cholecystectomy using deep learning. Annals of surgery. 2022 May 1;275(5):955-61.
Mascagni P, Alapatt D, Garcia A, Okamoto N, Vardazaryan A, Costamagna G, Dallemagne B, Padoy N. Surgical data science for safe cholecystectomy: a protocol for segmentation of hepatocystic anatomy and assessment of the critical view of safety. arXiv preprint arXiv:2106.10916. 2021 Jun 21.
Murali A, Alapatt D, Mascagni P, Vardazaryan A, Garcia A, Okamoto N, Costamagna G, Mutter D, Marescaux J, Dallemagne B, Padoy N. The endoscapes dataset for surgical scene segmentation, object detection, and critical view of safety assessment: official splits and benchmark. arXiv preprint arXiv:2312.12429. 2023 Dec 19.