Neural networks are now practically doing most of the coding instead of us (and better than us) in many fields, but we don’t have a satisfactory methodology for testing their behavior.
A very interesting recent paper that was covered on “The Morning Paper” blog, shows a simple and elegant approach for testing neural networks.
Basically, they say that current DL testing methods, ‘depend heavily on manually labeled data and therefore often fail to expose erroneous behaviors for rare inputs’.
Even if your test set is relatively large, it still might activate only a subset of the network’s neurons, and the rest will not be tested. The untested part could still be activated in the real world by rare examples, and cause extreme and unexpected behavior.
They learn realistic augmentations to the test images, to maximize two objectives:
1) Similar to code coverage, which is a well-known software testing technique, they try to maximize “neuron coverage” during their tests. They do this in order to maximize the probability they uncover as much of the possible (even if rare) wrong behaviors that were learned by the neural net.
2) They take an ensemble of at least two comparable but independently trained models. They try to find samples which maximize the different models disagreement. To the best of my understanding, they use this approach since the images they’re using are learned augmentations, so they have no verified ground-truth for them.
Maximizing both of these objectives in parallel helps uncover rare and potentially dangerous behaviors.
They limit their learned augmentations by pre-defining the different transformations. For example – changing the lighting by adding a constant to all pixel values, and they learn the correct value that maximizes both objectives above. They don’t allow unrealistic adversarial examples which contain only tiny perturbations that will be undetected by the human eye.
In the image, you can see an example of a synthetic input learned by using their approach (changing the lighting only), that induced extreme and dangerous behavior in a self-driving car network (crashing into the rails).
We are aware of these behaviors and use the most advanced methods, such as this one, to mitigate them. From what we’ve seen, this kind of phenomena is less prevalent in medical imaging, since the image acquisition constellation in medical imaging is relatively fixed (parameters like lighting power, sensor distance, etc. are relatively constant between different studies). In parallel, to us, this kind of behaviors supports our mission of building A.I. that augments the Radiologist, while still keeping the radiologist at the center of decision making.