Visual attention mechanisms are known to be important components of modern computer vision systems and are an inherent part of state-of-the-art achievements in almost all fields: object detection, image-captioning, and more. Most conventional visual attention mechanisms use medical image captioning and VQA (visual question answering) from a Top-Down approach, a task-specific method that assigns captions based on selectively determined weightings of image features. The opposing Bottom-Up approach is a purely visual feed-forward attention mechanism that first determines targeted image regions and then assigns feature vectors to those regions.
A recent paper takes the concept of visual attention and image captioning one step further by combining both a top-down and bottom-up approach to assigning captions to images. The mechanisms they used can be divided into two categories: detection proposal and global attention mechanisms.
1.) Detection proposals, such as the Faster R-CNN (RPN) proposals. The ROI-Pooling operation is an attention mechanism that enables the second stage of the detector to attend only to the relevant features. The disadvantage of this approach is that it doesn’t use information outside of that proposal that can be very important for classifying it correctly in the second stage.
2.) Global attention mechanisms, which re-weight the entire feature map according to a learned attention “heat map”. The disadvantage is this approach does not use information about objects in the image to generate the attention map.
The authors of the paper combine the two approaches into one to mitigate their individual disadvantages. They generate the attention map over the proposals created by the RPN, rather than an attention map over the global feature map. This is a strong mechanism that is illustrated in the images below.
Figure 1. Qualitative differences between attention methodologies in caption generation. A red box outlines a weighted and attended image region for which a generated word is attached to. The Resnet model (top) hallucinates a toilet in the unusual picture of a bathroom containing a couch. This model generates a poor and incorrectly labeled caption of “toilet” when no toilet exists. The Up-Down model (bottom) clearly identifies the out-of-context couch, generating a correct caption while also providing more interpretable attention weights.
To implement this approach, the authors used Faster R-CNN to generate the 36 top proposals and ROI-Pool each proposal to a 2048-d feature map (with average pooling).
These pooled feature maps were averaged into a single feature map and fed into the attention LSTM. The output of the attention LSTM is a weight vector of size 36 (one weight for each proposal).
The next stage of the process is to calculate the attended feature map, by summing all of the pooled feature maps according to their predicted weights. These attended feature maps can be used as an input for a second network that performs the actual task. In the paper, it was served as an input to another LSTM which generated a single word for the image captioning task at each timestep.
This attention mechanism can be very valuable in many tough domains. For example, in the world of deep learning medical imaging, there are many possible use cases for this mechanism. In brain CT-scan analysis, if there is a proposal for a brain hemorrhage in the right hemisphere of the brain, and on the other side there is no such proposal – it significantly increases the probability that this proposal is an important abnormality. If a proposal exists on both hemispheres – it significantly decreases the probability of it being a hemorrhage.
Since the brain is far from being a perfectly symmetric structure – it is not possible to solve this kind of problem using actual image mirroring. This kind of combined Top-Bottom and Bottom-Up visual attention mechanism can attend to each proposal and use information about the objects that are relevant as context for making these kinds of deep learning decisions.