Review Article

The Understanding of Deep Learning: A Comprehensive Review

Figure 3

From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to “translate” high-level representations of images into captions. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (underlined), we found that it exploits this to achieve better ‘translation’ of images into captions (https://casmls.github.io/general/2016/10/16/attention_model.html).