Obtaining Cross Modal Similarity Metric with Deep Neural Architecture

<div>The MLP system used in our experiments. In the lower right corner, an image is represented using MPEG-7 and gist descriptors forming a vector with 1,704 neurons. And in the upper right corner, the corresponding tag words are first represented using the BOW model. Furthermore, a replicated softmax RBM with 4,000 visible neurons and hidden 1,024 neurons is adopted to learn a text representation. Finally, in the dashed box, from the bottom to the top, an MLP with two hidden layers is designed to learn the mapping from image modality to text modality.</div>

Mathematical Problems in Engineering

fig4

Figure 4

Figure 4: Obtaining Cross Modal Similarity Metric with Deep Neural Architecture