ProstaTD: Bridging Surgical Triplet from Classification to Fully Supervised Detection

ICLR 2026 Conference SubmissionAnonymous Authors
Surgical TripletEndoscopyBenchmarkDatasetEvaluation
Abstract:

Surgical triplet detection is a critical task in surgical video analysis, with significant implications for performance assessment and training novice surgeons. However, existing datasets like CholecT50 lack precise spatial bounding box annotations, rendering triplet classification at the image level insufficient for practical applications. The inclusion of bounding box annotations is essential to make this task meaningful, as they provide the spatial context necessary for accurate analysis and improved model generalizability. To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet activity. The dataset comprises 71,775 video frames and 196,490 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 60 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. To further facilitate future general-purpose surgical annotation, we developed two tailored labeling tools to improve efficiency and scalability in our annotation workflows. In addition, we created a surgical triplet detection evaluation toolkit that enables standardized and reproducible performance assessment across studies. ProstaTD is the largest and most diverse surgical triplet dataset to date, moving the field from simple classification to full detection with precise spatial and temporal boundaries and thereby providing a robust foundation for fair benchmarking.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ProstaTD, a large-scale dataset for surgical triplet detection in robot-assisted prostatectomy, featuring 71,775 frames and 196,490 annotated triplet instances with bounding boxes across 21 multi-institutional surgeries. It resides in the 'Large-Scale Multi-Institutional Datasets with Bounding Box Annotations' leaf, which contains only one sibling paper. This represents a relatively sparse research direction within the broader taxonomy of eight total papers, suggesting the work addresses an emerging need for spatially annotated surgical triplet datasets beyond existing image-level classification resources like CholecT50.

The taxonomy reveals three main branches: Dataset Development, Methodological Approaches, and Clinical Applications. ProstaTD sits within Dataset Development, adjacent to 'Holistic Surgical Scene Understanding with Pixel-Wise Recognition' (one paper) and separate from methodological leaves addressing deep learning frameworks, disentanglement, and adversarial robustness (three papers total). The dataset's multi-institutional scope and bounding box annotations position it as infrastructure enabling the methodological innovations in neighboring branches, while its prostatectomy focus distinguishes it from broader endoscopic surgery datasets. The sparse population of its leaf suggests limited prior work specifically combining large-scale triplet detection with precise spatial annotations across institutions.

Among 29 candidates examined, the analysis identified potential overlap for all three contributions. The core dataset contribution (10 candidates examined, 1 refutable) shows the most novelty, though one prior work appears to provide similar multi-institutional triplet annotations. The annotation tools contribution (9 candidates, 1 refutable) and evaluation toolkit (10 candidates, 2 refutable) face more substantial prior work, with existing open-source labeling frameworks and benchmark protocols identified. These statistics reflect a focused semantic search rather than exhaustive coverage, indicating that within the examined scope, the dataset's scale and domain specificity appear more distinctive than its tooling and evaluation components.

Based on the limited search of 29 candidates, the work's primary novelty appears to lie in its domain-specific scale and multi-institutional scope for prostatectomy triplet detection with spatial annotations. The sparse taxonomy leaf (one sibling) and contribution-level statistics suggest the dataset addresses a genuine gap, though the annotation tools and benchmarking components encounter more established prior work. This assessment reflects top-K semantic matches and may not capture domain-specific precedents outside the search scope.

Taxonomy

Core-task Taxonomy Papers
8
3
Claimed Contributions
29
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Surgical triplet detection in robot-assisted prostatectomy videos involves identifying and linking three key elements—instruments, verbs (actions), and targets (anatomical structures)—within surgical video frames. The field's structure, as reflected in the taxonomy, organizes around three main branches: Dataset Development and Annotation Infrastructure, which focuses on creating large-scale, richly annotated video corpora with bounding boxes and triplet labels; Methodological Approaches for Triplet Recognition and Detection, encompassing algorithmic innovations from deep learning architectures to disentanglement strategies; and Clinical Applications and Surgical Practice, which bridges technical advances to real-world surgical workflows and training. Representative works like ProstaTD Dataset[3] exemplify the dataset development effort by providing multi-institutional annotations, while studies such as Triplet Disentanglement[7] illustrate methodological refinements that decompose the triplet recognition problem into more tractable sub-tasks. Guardian[1] and AI Endoscopy Surgery[2] demonstrate how these technical foundations support clinical decision-making and intraoperative guidance. A particularly active line of work centers on scaling annotation quality and diversity across institutions, addressing challenges such as inter-annotator variability and the need for pixel-wise versus bounding-box granularity, as seen in Pixel-wise Surgical[4]. Methodological debates revolve around whether to treat triplet detection as a unified end-to-end problem or to disentangle instrument detection, action recognition, and target localization into separate stages. ProstaTD Bridging[0] sits within the Dataset Development and Annotation Infrastructure branch, specifically targeting large-scale multi-institutional datasets with bounding box annotations. Compared to ProstaTD Dataset[3], which established foundational annotation protocols, ProstaTD Bridging[0] appears to extend this infrastructure by addressing cross-institutional harmonization and bridging gaps in annotation consistency. This positions the work as a natural evolution in dataset maturity, complementing methodological advances like those in Deep Learning Action[6] and supporting broader clinical integration efforts exemplified by Triple-console Telesurgery[5] and Single Port Prostatectomy[8].

Claimed Contributions

ProstaTD dataset for surgical triplet detection

The authors present ProstaTD, the first large-scale dataset enabling fully supervised surgical triplet detection at the procedure level. It contains 71,775 frames with 196,490 annotated triplet instances from 21 multi-institutional surgeries, featuring precise bounding boxes and clinically defined temporal boundaries for each triplet.

10 retrieved papers
Can Refute
Open-source annotation tools for surgical triplet labeling

The authors developed two dedicated annotation applications (Triplet-labelme and SurgLabel) specifically designed for surgical triplet annotation. These tools support single-frame triplet editing and high-throughput batch labeling, and will be released as open source to facilitate large-scale annotation across diverse surgical procedures.

9 retrieved papers
Can Refute
Evaluation toolkit and benchmark for surgical triplet detection

The authors introduce an evaluation toolkit (ivtdmetrics) tailored for surgical triplet detection benchmarking, supporting metrics such as mAP at various IoU thresholds, precision, recall, and F1-score. They also provide comprehensive benchmarks using state-of-the-art models and propose TDnet as a baseline method.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProstaTD dataset for surgical triplet detection

The authors present ProstaTD, the first large-scale dataset enabling fully supervised surgical triplet detection at the procedure level. It contains 71,775 frames with 196,490 annotated triplet instances from 21 multi-institutional surgeries, featuring precise bounding boxes and clinically defined temporal boundaries for each triplet.

Contribution

Open-source annotation tools for surgical triplet labeling

The authors developed two dedicated annotation applications (Triplet-labelme and SurgLabel) specifically designed for surgical triplet annotation. These tools support single-frame triplet editing and high-throughput batch labeling, and will be released as open source to facilitate large-scale annotation across diverse surgical procedures.

Contribution

Evaluation toolkit and benchmark for surgical triplet detection

The authors introduce an evaluation toolkit (ivtdmetrics) tailored for surgical triplet detection benchmarking, supporting metrics such as mAP at various IoU thresholds, precision, recall, and F1-score. They also provide comprehensive benchmarks using state-of-the-art models and propose TDnet as a baseline method.