Axel Wismüller, Larry Stockmaster
The quantitative evaluation of Artificial Intelligence (AI) systems in a clinical context is a challenging endeavor, where the development and implementation of meaningful performance metrics is still in its infancy. Here, we propose a scientific concept, Artificial Intelligence Prospective Randomized Observer Blinding Evaluation (AI-PROBE) for quantitative clinical performance evaluation of radiology AI systems within prospective randomized clinical trials. Our evaluation workflow encompasses a study design and a corresponding radiology Information Technology (IT) infrastructure that randomly blinds radiologists with regards to the presence of positive reads as provided by AI-based image analysis systems. To demonstrate the applicability of our AI-evaluation framework, we present a first prospective randomized clinical trial on investigating the effect of automatic identification of Intra-Cranial Hemorrhage (ICH) in emergent care head CT scans on radiology study Turn-Around Time (TAT) in a clinical environment. Here, we acquired 620 consecutive non-contrast head CT scans from CT scanners used for inpatient and emergency room patients at a large academic hospital over a time period of 14 consecutive days. Immediately following image acquisition, scans were automatically analyzed for the presence of ICH using commercially available software (Aidoc, Tel Aviv, Israel).
Cases identified as positive for ICH by AI (ICH-AI+) were automatically flagged in the radiologists’ reading worklists, where flagging was randomly switched off with a probability of 50%. Study TAT was measured automatically as the time difference between study completion and first clinically communicated study reporting, with time stamps for these events automatically retrieved from various radiology IT systems. TATs for flagged cases (73 ± 143 min) were significantly lower than TATs for non-flagged (132 ± 193 min) cases (p<0.05, one-sided t-test), where 105 of the 122 ICH-AI+ cases were true positive reads. Total sensitivity, specificity, and accuracy over all analyzed cases were 95.0%, 96.7%, and 96.4%, respectively. We conclude that automatic identification of ICH reduces study TAT for ICH in emergent care head CT settings, which carries the potential for improving clinical management of ICH by accelerating clinically indicated therapeutic interventions. In a broader context, our results suggest that our AI-PROBE framework can contribute to a systematic quantitative evaluation of AI systems in a clinical workflow environment with regards to clinically meaningful performance measures, such as TAT or diagnostic accuracy metrics.