Obtaining Cross Modal Similarity Metric with Deep Neural Architecture

<div>A deep framework is used for measuring the similarity of cross modal data such as images and text. From left to right, first, the classical methods for each modality could be used to extract basic modality-specific features. For example, we use MPEG-7, Gist, and some known features descriptors for images; we use bag-of-words model for tags. Second, for each modality two RBMs are stacked for extracting intermediate modality-specific features. For images, we stack a binary RBM over a Gaussian RBM; for text, we stack a binary RBM over a replicated softmax. Third, an autoencoder with similar constraint is used for extracting similar presentations. The number in each box is the neurons adopted in this layer.</div>

Mathematical Problems in Engineering

fig1

Figure 1

Figure 1: Obtaining Cross Modal Similarity Metric with Deep Neural Architecture