Fig 4 - uploaded by Yonghao Long
Content may be subject to copyright.
Confusion matrices visualized by color brightness of three methods (a) Baseline (SR), (b) TMRNet − (LR), and (c) TMRNet (LR+MS). In each matrix, X and Y-axis indicate predicted phase label and ground truth; element (x, y) represents the empirical probability of predicting phase x given ground truth of phase y. Normalization is conducted along X-axis, with the summation of probabilities along X-axis equals to 1.

Confusion matrices visualized by color brightness of three methods (a) Baseline (SR), (b) TMRNet − (LR), and (c) TMRNet (LR+MS). In each matrix, X and Y-axis indicate predicted phase label and ground truth; element (x, y) represents the empirical probability of predicting phase x given ground truth of phase y. Normalization is conducted along X-axis, with the summation of probabilities along X-axis equals to 1.

Source publication
Article
Full-text available
Automatic surgical workflow recognition is a key component for developing context-aware computer-assisted systems in the operating theatre. Previous works either jointly modeled the spatial features with short fixed-range temporal information, or separately learned visual and long temporal cues. In this paper, we propose a novel end-to-end temporal...

Context in source publication

Context 1
... order to more comprehensively analyze the contribution of different supportive cues, we further visualize the confusion matrices to show the detailed results in phase-level. The matrices with ResNet backbone are illustrated in Fig. 4. We observe that from (a) to (c), the probability percentages on diagonals (recall) consistently increase while noisy misclassification patterns gradually decrease. Particularly, the incorrect recognition of P5 and P6 into P4, P7 into P1 is greatly alleviated by providing long-range multi-scale ...

Similar publications

Preprint
Full-text available
Automatic surgical workflow recognition is a key component for developing context-aware computer-assisted systems in the operating theatre. Previous works either jointly modeled the spatial features with short fixed-range temporal information, or separately learned visual and long temporal cues. In this paper, we propose a novel end-to-end temporal...

Citations

... Surgical phase recognition, the recognition of the transitions of high-level stages of surgery, is a fundamental component in advancing these objectives. Surgical phase recognition has gained considerable attention with numerous approaches [1,4,7,8,16,17,21]. While surgical phase recognition is important across all surgical methods, the predominant focus of research endeavors has been on minimally invasive surgery (MIS), leaving open surgery phase recognition comparatively underexplored. ...
Preprint
Full-text available
Surgical phase recognition has gained significant attention due to its potential to offer solutions to numerous demands of the modern operating room. However, most existing methods concentrate on minimally invasive surgery (MIS), leaving surgical phase recognition for open surgery understudied. This discrepancy is primarily attributed to the scarcity of publicly available open surgery video datasets for surgical phase recognition. To address this issue, we introduce a new egocentric open surgery video dataset for phase recognition, named EgoSurgery-Phase. This dataset comprises 15 hours of real open surgery videos spanning 9 distinct surgical phases all captured using an egocentric camera attached to the surgeon's head. In addition to video, the EgoSurgery-Phase offers eye gaze. As far as we know, it is the first real open surgery video dataset for surgical phase recognition publicly available. Furthermore, inspired by the notable success of masked autoencoders (MAEs) in video understanding tasks (e.g., action recognition), we propose a gaze-guided masked autoencoder (GGMAE). Considering the regions where surgeons' gaze focuses are often critical for surgical phase recognition (e.g., surgical field), in our GGMAE, the gaze information acts as an empirical semantic richness prior to guiding the masking process, promoting better attention to semantically rich spatial regions. GGMAE significantly improves the previous state-of-the-art recognition method (6.4% in Jaccard) and the masked autoencoder-based method (3.1% in Jaccard) on EgoSurgery-Phase. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.
... Surgical instrument segmentation (SIS) is a crucial task in surgical vision, aimed at precisely delineating surgical instruments in operative scenes. It provides vital assistance to surgeons and facilitates the development of advanced computer-assisted operation systems (Shademan et al. 2016;Jin et al. 2021;Liu et al. 2021;Jian et al. 2020;Yue et al. 2023;Zhang and Tao 2020). Existing deep learning methods for SIS have achieved impressive results through the design and training of specialist models featuring task-specific components. ...
Article
The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to inferior generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM’s pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code is available at https://github.com/wenxi-yue/SurgicalSAM.
... The first AI-based approach was proposed by Twinanda et al. [9] to segment laparoscopic cholecystectomy videos into seven surgical phases. Inspired by their work and after the introduction of a large, annotated dataset called Cholec80, other researchers have attempted numerous techniques to improve the performance of phase detection [10][11][12][13][14]. Similar approaches have been applied to video recordings of other procedures [15,16]. ...
Article
Full-text available
Background Video-based review is paramount for operative performance assessment but can be laborious when performed manually. Hierarchical Task Analysis (HTA) is a well-known method that divides any procedure into phases, steps, and tasks. HTA requires large datasets of videos with consistent definitions at each level. Our aim was to develop an AI model for automated segmentation of phases, steps, and tasks for laparoscopic cholecystectomy videos using a standardized HTA. Methods A total of 160 laparoscopic cholecystectomy videos were collected from a publicly available dataset known as cholec80 and from our own institution. All videos were annotated for the beginning and ending of a predefined set of phases, steps, and tasks. Deep learning models were then separately developed and trained for the three levels using a 3D Convolutional Neural Network architecture. Results Four phases, eight steps, and nineteen tasks were defined through expert consensus. The training set for our deep learning models contained 100 videos with an additional 20 videos for hyperparameter optimization and tuning. The remaining 40 videos were used for testing the performance. The overall accuracy for phases, steps, and tasks were 0.90, 0.81, and 0.65 with the average F1 score of 0.86, 0.76 and 0.48 respectively. Control of bleeding and bile spillage tasks were most variable in definition, operative management, and clinical relevance. Conclusion The use of hierarchical task analysis for surgical video analysis has numerous applications in AI-based automated systems. Our results show that our tiered method of task analysis can successfully be used to train a DL model.
... It is challenging to achieve such a high efficiency without compressing model parameters and sacrificing model performance. Although some representative works, such as TMRNet 19 and Trans-SVNet 20 , achieved promising results with versatile models, their dependence on the considerable computational resources constrains their potential for clinical application. To date, how to effectively address this dilemma for successfully deploying AI models in the operating room is still an open question. ...
Article
Full-text available
Recent advancements in artificial intelligence have witnessed human-level performance; however, AI-enabled cognitive assistance for therapeutic procedures has not been fully explored nor pre-clinically validated. Here we propose AI-Endo, an intelligent surgical workflow recognition suit, for endoscopic submucosal dissection (ESD). Our AI-Endo is trained on high-quality ESD cases from an expert endoscopist, covering a decade time expansion and consisting of 201,026 labeled frames. The learned model demonstrates outstanding performance on validation data, including cases from relatively junior endoscopists with various skill levels, procedures conducted with different endoscopy systems and therapeutic skills, and cohorts from international multi-centers. Furthermore, we integrate our AI-Endo with the Olympus endoscopic system and validate the AI-enabled cognitive assistance system with animal studies in live ESD training sessions. Dedicated data analysis from surgical phase recognition results is summarized in an automatically generated report for skill assessment.
... C. Evaluation of the spatiotemporal aggregation network 1) Comparisons with state-of-the-art methods: We compared our LS-SAT network with several state-of-the-art methods in online surgical phase recognition, including 1) ResNet-50 [24], which serves as the backbone of our spatial feature extractor; 2) SV-RCNet [9], a LSTM-based temporal aggregation architecture; 3) TMRNet [30], a memory bankbased long-range temporal aggregation network; 4) TeCNO [10], a multi-stage TCN-based hierarchical refinement network; 5) OperA [11], a transformer-based spatial feature aggregation network; 6) Trans-SVNet [12], a transformerbased spatiotemporal aggregation network. For ResNet-50 [24], SV-RCNet [9], TeCNO [10], TMRNet [30], and Trans-SVNet [12], we implemented them using their open-sourced code. ...
... C. Evaluation of the spatiotemporal aggregation network 1) Comparisons with state-of-the-art methods: We compared our LS-SAT network with several state-of-the-art methods in online surgical phase recognition, including 1) ResNet-50 [24], which serves as the backbone of our spatial feature extractor; 2) SV-RCNet [9], a LSTM-based temporal aggregation architecture; 3) TMRNet [30], a memory bankbased long-range temporal aggregation network; 4) TeCNO [10], a multi-stage TCN-based hierarchical refinement network; 5) OperA [11], a transformer-based spatial feature aggregation network; 6) Trans-SVNet [12], a transformerbased spatiotemporal aggregation network. For ResNet-50 [24], SV-RCNet [9], TeCNO [10], TMRNet [30], and Trans-SVNet [12], we implemented them using their open-sourced code. As for OperA [11], we implemented it based on the network architecture and settings described in the original paper. ...
... The experimental results on two datasets demonstrated that our proposed online surgical phase recognition method achieved smoother surgical phase prediction results compared to frame-wise methods [24] and short-range temporal aggregation networks [9], [30]. This outcome is highly advantageous for the implementation of AR guidance as it prevents potential interference to ophthalmologists resulting from incorrect superposition of visual cues caused by frequent misrecognition of surgical phases. ...
Preprint
Phacoemulsification cataract surgery (PCS) is a routine procedure conducted using a surgical microscope, heavily reliant on the skill of the ophthalmologist. While existing PCS guidance systems extract valuable information from surgical microscopic videos to enhance intraoperative proficiency, they suffer from non-phasespecific guidance, leading to redundant visual information. In this study, our major contribution is the development of a novel phase-specific augmented reality (AR) guidance system, which offers tailored AR information corresponding to the recognized surgical phase. Leveraging the inherent quasi-standardized nature of PCS procedures, we propose a two-stage surgical microscopic video recognition network. In the first stage, we implement a multi-task learning structure to segment the surgical limbus region and extract limbus region-focused spatial feature for each frame. In the second stage, we propose the long-short spatiotemporal aggregation transformer (LS-SAT) network to model local fine-grained and global temporal relationships, and combine the extracted spatial features to recognize the current surgical phase. Additionally, we collaborate closely with ophthalmologists to design AR visual cues by utilizing techniques such as limbus ellipse fitting and regional restricted normal cross-correlation rotation computation. We evaluated the network on publicly available and in-house datasets, with comparison results demonstrating its superior performance compared to related works. Ablation results further validated the effectiveness of the limbus region-focused spatial feature extractor and the combination of temporal features. Furthermore, the developed system was evaluated in a clinical setup, with results indicating remarkable accuracy and real-time performance. underscoring its potential for clinical applications.
... This information is vital when the data window used for estimation does not contain sufficient distinctive information to generate reliable outputs. An example in Laparoscopic Cholecystectomy (LC) operations is the limited inter-phase and high intra-phase variance among frames, which refers to weak dissimilarities of signals in different classes and substantial differences in the same classes, see Fig. 3 [44], [60]. These frames can be recognized correctly by aggregating related temporal information. ...
... Even though a particular method may be effective for one type of operation, it may not be suitable for use in another. While limited inter-phase and high intra-phase variance problems are reported in LC [29], [44], [60], the strong similarity of frames needs to be considered in cataract surgeries [48]. In the SG procedure, an extreme classimbalance problem is reported [46]. ...
... Additionally, more complex models are also considered. Jin et al. [44] compared their model with ResNet and ResNeSt [89] backbones and reported improved results with ResNeSt. Zhang et al. [56] made an ablation study for feature extraction with a CNN-based network R(2 + 1)D [90], a BERT [91] based transformer network and a hybrid model with their combinations. ...
Article
Full-text available
Objective: In the last two decades, there has been a growing interest in exploring surgical procedures with statistical models to analyze operations at different semantic levels. This information is necessary for developing context-aware intelligent systems, which can assist the physicians during operations, eval-uate procedures afterward or help the management team to effec-tively utilize the operating room. The objective is to extract reliable patterns from surgical data for the robust estimation of surgical activities performed during operations. The purpose of this paper is to review the state-of-the-art deep learning methods that have been published after 2018 for analyzing surgical workflows, with a focus on phase and step recognition. Methods: Three databases, IEEE Xplore, Scopus, and PubMed were searched, and additional studies are added through a manual search. After the database search, 343 studies were screened and a total of 45 studies are selected for this review. Conclusion: The use of temporal information is essential for identifying the next surgical action. Contemporary methods used mainly RNNs, hierarchical CNNs, and Transformers to preserve long-distance temporal relations. The lack of large publicly available datasets for various procedures is a great challenge for the development of new and robust models. As supervised learning strategies are used to show proof-of-concept, self-supervised, semi-supervised, or active learning methods are used to mitigate dependency on annotated data. Significance: The present study provides a comprehensive review of recent methods in surgical workflow analysis, summarizes commonly used archi-tectures, datasets, and discusses challenges.
... This information is vital when the data window used for estimation does not contain sufficient distinctive information to generate reliable outputs. An example in Laparoscopic Cholecystectomy (LC) operations is the limited inter-phase and high intra-phase variance among frames, which refers to weak dissimilarities of signals in different classes and substantial differences in the same classes, see Fig. 3 [44], [60]. These frames can be recognized correctly by aggregating related temporal information. ...
... Even though a particular method may be effective for one type of operation, it may not be suitable for use in another. While limited inter-phase and high intra-phase variance problems are reported in LC [29], [44], [60], the strong similarity of frames needs to be considered in cataract surgeries [48]. In the SG procedure, an extreme classimbalance problem is reported [46]. ...
... Additionally, more complex models are also considered. Jin et al. [44] compared their model with ResNet and ResNeSt [89] backbones and reported improved results with ResNeSt. Zhang et al. [56] made an ablation study for feature extraction with a CNN-based network R(2 + 1)D [90], a BERT [91] based transformer network and a hybrid model with their combinations. ...
Preprint
Full-text available
p>Objective: In the last two decades, there has been a growing interest in exploring surgical procedures with statistical models to analyze operations at different semantic levels. This information is necessary for developing context-aware intelligent systems, which can assist the physicians during operations, evaluate procedures afterward or help the management team to effectively utilize the operating room. The objective is to extract reliable patterns from surgical data for the robust estimation of surgical activities performed during operations. The purpose of this paper is to review the state-of-the-art deep learning methods that have been published after 2018 for analyzing surgical workflows, with a focus on phase and step recognition. Methods: Three databases, IEEE Xplore, Scopus, and PubMed were searched, and additional studies are added through a manual search. After the database search, 343 studies were screened and a total of 44 studies are selected for this review. Conclusion: The use of temporal information is essential for identifying the next surgical action. Contemporary methods used mainly RNNs, hierarchical CNNs, and Transformers to preserve long-distance temporal relations. The lack of large publicly available datasets for various procedures is a great challenge for the development of new and robust models. As supervised learning strategies are used to show proof-of-concept, self-supervised, semi-supervised, or active learning methods are used to mitigate dependency on annotated data. Significance: The present study provides a comprehensive review of recent methods in surgical workflow analysis, summarizes commonly used architectures, datasets, and discusses challenges.</p
... Memory Models. Memory models have been widely used in many video analysis and episodic tasks, such as action recognition [59,29], video object segmentation [38,43,34], video captioning [44], reinforcement learning [22], physical reasoning [21], and code generation [35]. These works utilize an external memory to store features of prolonged sequences, which can be timeindexed for historical information integration. ...
Preprint
Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers.
... Surgical instrument segmentation (SIS) is a crucial task in surgical vision, aimed at precisely delineating surgical instruments in operative scenes. It provides vital assistance to surgeons and facilitates the development of advanced computer-assisted operation systems (Shademan et al. 2016;Jin et al. 2021;Liu et al. 2021). Existing deep learning methods for SIS have achieved impressive results through the design and training of specialist models featuring task-specific components. ...
Preprint
The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to poor generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM's pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code will be released at https://github.com/wenxi-yue/SurgicalSAM.
... The dataset of videos was first down-sampled to 1 frame per second (FPS). Down-sampling was done to reduce memory usage and 1 FPS has been found to be sufficient in the phase segmentation in prior literature [12]. To develop and train the CV algorithm, supervised learning was employed [13]. ...
Article
Full-text available
Background Automation of surgical phase recognition is a key effort toward the development of Computer Vision (CV) algorithms, for workflow optimization and video-based assessment. CV is a form of Artificial Intelligence (AI) that allows interpretation of images through a deep learning (DL)-based algorithm. The improvements in Graphic Processing Unit (GPU) computing devices allow researchers to apply these algorithms for recognition of content in videos in real-time. Edge computing, where data is collected, analyzed, and acted upon in close proximity to the collection source, is essential meet the demands of workflow optimization by providing real-time algorithm application. We implemented a real-time phase recognition workflow and demonstrated its performance on 10 Robotic Inguinal Hernia Repairs (RIHR) to obtain phase predictions during the procedure. Methods Our phase recognition algorithm was developed with 211 videos of RIHR originally annotated into 14 surgical phases. Using these videos, a DL model with a ResNet-50 backbone was trained and validated to automatically recognize surgical phases. The model was deployed to a GPU, the Nvidia® Jetson Xavier™ NX edge computing device. Results This model was tested on 10 inguinal hernia repairs from four surgeons in real-time. The model was improved using post-recording processing methods such as phase merging into seven final phases (peritoneal scoring, mesh placement, preperitoneal dissection, reduction of hernia, out of body, peritoneal closure, and transitionary idle) and averaging of frames. Predictions were made once per second with a processing latency of approximately 250 ms. The accuracy of the real-time predictions ranged from 59.8 to 78.2% with an average accuracy of 68.7%. Conclusion A real-time phase prediction of RIHR using a CV deep learning model was successfully implemented. This real-time CV phase segmentation system can be useful for monitoring surgical progress and be integrated into software to provide hospital workflow optimization. Graphical abstract