Human visual explanations mitigate bias in AI-based assessment of surgeon skills

Kiyasseh, Dani; Laca, Jasper; Haque, Taseen F.; Otiato, Maxwell; Miles, Brian J.; Wagner, Christian; Donoho, Daniel A.; Trinh, Quoc-Dien; Anandkumar, Animashree; Hung, Andrew J.

doi:10.1038/s41746-023-00766-2

Download PDF

Article
Open access
Published: 30 March 2023

Human visual explanations mitigate bias in AI-based assessment of surgeon skills

npj Digital Medicine volume 6, Article number: 54 (2023) Cite this article

8540 Accesses
55 Altmetric
Metrics details

Subjects

Abstract

Artificial intelligence (AI) systems can now reliably assess surgeon skills through videos of intraoperative surgical activity. With such systems informing future high-stakes decisions such as whether to credential surgeons and grant them the privilege to operate on patients, it is critical that they treat all surgeons fairly. However, it remains an open question whether surgical AI systems exhibit bias against surgeon sub-cohorts, and, if so, whether such bias can be mitigated. Here, we examine and mitigate the bias exhibited by a family of surgical AI systems—SAIS—deployed on videos of robotic surgeries from three geographically-diverse hospitals (USA and EU). We show that SAIS exhibits an underskilling bias, erroneously downgrading surgical performance, and an overskilling bias, erroneously upgrading surgical performance, at different rates across surgeon sub-cohorts. To mitigate such bias, we leverage a strategy —TWIX—which teaches an AI system to provide a visual explanation for its skill assessment that otherwise would have been provided by human experts. We show that whereas baseline strategies inconsistently mitigate algorithmic bias, TWIX can effectively mitigate the underskilling and overskilling bias while simultaneously improving the performance of these AI systems across hospitals. We discovered that these findings carry over to the training environment where we assess medical students’ skills today. Our study is a critical prerequisite to the eventual implementation of AI-augmented global surgeon credentialing programs, ensuring that all surgeons are treated fairly.

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

Article Open access 30 March 2023

Real-Time multifaceted artificial intelligence vs In-Person instruction in teaching surgical technical skills: a randomized controlled trial

Article Open access 02 July 2024

A Rectal Cancer Surgery Dataset: Use of artificial intelligence to aid automation of error identification

Article Open access 30 January 2025

Introduction

The quality of a surgeon’s intraoperative activity (skill-level) can now be reliably assessed through videos of surgical procedures and artificial intelligence (AI) systems^1,2,3. With these AI-based skill assessments on the cusp of informing high-stakes decisions on a global scale such as the credentialing of surgeons^4,5, it is critical that they are unbiased—reliably reflecting the true skill-level of all surgeons equally^6,7. However, it remains an open question whether such surgical AI systems exhibit a bias against certain surgeon sub-cohorts. Without an examination and mitigation of these systems’ algorithmic bias, they may unjustifiably rate surgeons differently, erroneously delaying (or hastening) the credentialing of surgeons, and thus placing patients’ lives at risk^8,9.

A surgeon typically masters multiple skills (e.g., needle handling and driving) necessary for surgery^10,11,12. To reliably automate the assessment of such skills, multiple AI systems (one for each skill) are often developed (Fig. 1a). To test the robustness of these systems, they are typically deployed on data from multiple hospitals¹³. We argue that the bias of any one of these systems, which manifests as a discrepancy in its performance across surgeon sub-cohorts (e.g., novices vs. experts), is akin to one of many light bulbs in an electric circuit connected in series (Fig. 1b). With a single defective light bulb influencing the entire circuit, just one biased AI system is enough to disadvantage a surgeon sub-cohort. Therefore, the deployment of multiple AI systems across multiple hospitals, a common feat in healthcare, necessitates that we examine and mitigate the bias of all such systems collectively. Doing so will ethically guide the impending implementation of AI-augmented global surgeon credentialing programs^14,15.

**Fig. 1: Mitigating bias of multiple surgical AI systems across multiple hospitals.**

Previous studies have focused on algorithmic bias exclusively against patients, demonstrating that AI systems systematically underestimate the pain level of Black patients¹⁶ and falsely predict that female Hispanic patients are healthy¹⁷. The study of bias in video-based AI systems has also gained traction, in the context of automated video interviews¹⁸, algorithmic hiring¹⁹, and emotion recognition²⁰. Previous work has not, however, investigated the bias of AI systems applied to surgical videos²¹, thereby overlooking its effect on surgeons. Further, previous attempts to mitigate such bias are either ineffective^22,23,24 or are limited to a single AI system deployed in a single hospital^25,26,27, casting doubt on their wider applicability. As such, previous studies do not attempt, nor demonstrate the effectiveness of a strategy, to mitigate the bias exhibited by multiple AI systems across multiple hospitals.

In this study, we examine the bias exhibited by a family of surgical AI systems—SAIS³—developed to assess the binary skill-level (low vs. high skill) of multiple surgical activities from videos. Through experiments on data from three geographically-diverse hospitals, we show that SAIS exhibits an underskilling bias, erroneously downgrading surgical performance, and an overskilling bias, erroneously upgrading surgical performance, at different rates across surgeon sub-cohorts. To mitigate such bias, we leverage a strategy—TWIX²⁸—that teaches an AI system to complement its skill assessments with a prediction of the importance of video frames, as provided by human experts (Fig. 1c). We show that TWIX can mitigate the underskilling and overskilling bias across hospitals and simultaneously improve the performance of AI systems for all surgeons. Our findings inform the ethical implementation of impending AI-augmented global surgeon credentialing programs.

Results

SAIS exhibits underskilling bias across hospitals

With skill assessment, we refer to the erroneous downgrading of surgical performance as underskilling. An underskilling bias is exhibited when such underskilling occurs at different rates across surgeon sub-cohorts. For binary skill assessment (low vs. high skill), which is the focus of our study, this bias is reflected by a discrepancy in the negative predictive value (NPV) of SAIS (see Methods, Fig. 6). We, therefore, present SAIS’ NPV for surgeons who have performed a different number of robotic surgeries during their lifetime (expert caseload >100), those operating on prostate glands of different volumes and of different cancer severity (Gleason score) (Fig. 2). Note that members of these groups are fluid as surgeons often have little say over, for example, the characteristics of the prostate gland they operate on. Please refer to the Methods section for our motivation behind selecting these groups and sub-cohorts.

**Fig. 2: SAIS exhibits an underskilling bias across hospitals.**

We found that SAIS exhibits an underskilling bias across hospitals (see Methods for description of data, Table 2 for the number of video samples). This is evident by, for example, the discrepancy in the negative predictive value across the two surgeon sub-cohorts operating on prostate glands of different volumes (≤49 ml and >49 ml). For example, when assessing the skill-level of needle handling at USC (Fig. 2a), SAIS achieved NPV ≈ 0.71 and 0.75 for the two sub-cohorts, respectively. Such an underskilling bias consistently appears across hospitals where NPV ≈ 0.80 and 0.93 at St. Antonius Hospital (SAH), and NPV ≈ 0.73 and 0.88 at Houston Methodist Hospital (HMH). These findings extend to when SAIS assessed the skill-level of the second surgical activity of needle driving (see Fig. 2b).

Overskilling bias

While our emphasis has been on the underskilling bias, we demonstrate that SAIS also exhibits an overskilling bias, where it erroneously upgrades surgical performance (see Supplementary Note 2).

Multi-class skill assessment

Although the emphasis of this study is on binary skill assessment, a decision driven primarily by the need to inspect the fairness of a previously-developed and soon-to-be-deployed AI system (SAIS), there has been a growing number of studies focused on multi-class skill assessment¹⁵. As such, we conducted a confined experiment to examine whether such a setup, in which needle handling is identified as either low, intermediate, or high skill also results in algorithmic bias (see Supplementary Note 3). We found that both the underskilling and overskilling bias continue to extend to this setting.

Underskilling bias persists even after controlling for potential confounding factors

Confounding factors may be responsible for the apparent underskilling bias^29,30. It is possible that the underskilling bias against surgeons with different caseloads (Fig. 2b) is driven by SAIS’ dependence on caseload, as a proxy, for skill assessment. For example, SAIS may have latched onto the effortlessness of expert surgeons’ intraoperative activity, as opposed to the strict skill assessment criteria (see Methods), as predictive of high-skill activity. However, after controlling for caseload, we found that SAIS’ outputs remain highly predictive of skill-level (odds ratio = 2.27), suggesting that surgeon caseload, or experience, plays a relatively smaller role in assessing skill³¹ (see Methods). To further check if SAIS was latching onto caseload-specific features in surgical videos, we retrained it on data with an equal number of samples from each class (low vs. high skill) and surgeon caseload group (novice vs. expert) and found that the underskilling bias still persists. This suggests that SAIS is unlikely to be dependent on unreliable caseload-specific features.

Examining bias across multiple AI systems and hospitals prevents misleading bias findings

With multiple AI systems deployed on the same group of surgeons across hospitals, we claim that examining the bias of only one of these AI systems can lead to misleading bias findings. Here, we provide evidence in support of this claim by focusing on the surgeon caseload group (also applies to other groups).

Multiple AI systems

We found that, had we examined bias for only needle handling, we would have erroneously assumed that SAIS disadvantaged novice surgeons exclusively. While SAIS did exhibit an underskilling bias against novice surgeons at USC when assessing the skill-level of needle handling, it exhibited this bias against expert surgeons when assessing the skill-level of the second surgical activity of needle driving. For example, SAIS achieved NPV ≈ 0.71 and 0.75 for novice and expert surgeons, respectively, for needle handling (Fig. 2a), whereas it achieved NPV ≈ 0.85 and 0.75 for these two sub-cohorts, for needle driving (Fig. 2b).

Multiple hospitals

We also found that, had we examined bias on data only from USC, we would have erroneously assumed that SAIS disadvantaged expert surgeons exclusively. While SAIS did exhibit an underskilling bias against expert surgeons at USC when assessing the skill-level of needle driving, it exhibited this bias against novice surgeons, to an even greater extent, at HMH. For example, SAIS achieved NPV ≈ 0.85 and 0.75 for novice and expert surgeons, respectively, at USC, whereas it achieved NPV ≈ 0.57 and 0.80 for these two sub-cohorts at HMH (Fig. 2b).

TWIX mitigates underskilling bias across hospitals

Although we demonstrated, in a previous study, that SAIS was able to generalize to data from different hospitals, we are acutely aware that AI systems are not perfect. They can, for example, depend on unreliable features as a shortcut to performing a task, otherwise known as spurious correlations³². We similarly hypothesized that SAIS, as a video-based AI system, may be latching onto unreliable temporal features (i.e., video frames) to perform skill assessment. At the very least, SAIS could be focusing on frames which are irrelevant to the task at hand and which could hinder its performance.

To test this hypothesis, we opted for an approach that directs an AI system’s focus onto frames deemed relevant (by human experts) while performing skill assessment. The intuition is that by learning to focus on features deemed most relevant by human experts, an AI system is less likely to latch onto unreliable features in a video when assessing surgeon skill. To that end, we leverage a strategy entitled training with explanations—TWIX²⁸—(see Methods). We present the performance of SAIS for the disadvantaged surgeon sub-cohorts before and after adopting TWIX when assessing the skill-level of needle handling (Fig. 3a) and needle driving (Fig. 3b).

**Fig. 3: TWIX mitigates the underskilling bias across hospitals.**

We found that TWIX mitigates the underskilling bias exhibited by SAIS. This is evident by the improvement in SAIS’ worst-case negative predictive value for the disadvantaged surgeon sub-cohorts after having adopted TWIX. For example, when SAIS was tasked with assessing the skill-level of needle handling at USC (Fig. 3a), worst-case NPV increased by 2% for the disadvantaged surgeon sub-cohort (novice) in the surgeon caseload group (see Fig. 2 to identify disadvantaged sub-cohorts). This finding was even more pronounced when SAIS was tasked with assessing the skill-level of needle driving at USC (Fig. 3b), with improvements in the worst-case NPV by up to 32%.

We also observed that TWIX, despite being adopted while SAIS was trained on data exclusively from USC, also mitigates bias when SAIS is deployed on data from other hospitals. This is evident by the improvements in SAIS’ performance for the disadvantaged surgeon sub-cohorts at SAH and, occasionally, at HMH. In cases where we observed a decrease in the worst-case performance, we found that this was associated with an overall decrease in the performance of SAIS (Fig. 4). We hypothesize that this reduction in performance is driven by the variability in the execution of surgical activity by surgeons across hospitals.

**Fig. 4: TWIX can improve AI system performance while mitigating bias across hospitals.**