Mathematical Problems in Engineering

Volume 2015, Article ID 293176, 9 pages

http://dx.doi.org/10.1155/2015/293176

## Obtaining Cross Modal Similarity Metric with Deep Neural Architecture

^{1}School of Computers, Beijing University of Posts and Telecommunications, Beijing 100876, China^{2}Engineering Research Center of Information Networks, Ministry of Education, Beijing 100876, China

Received 14 September 2014; Accepted 24 December 2014

Academic Editor: Florin Pop

Copyright © 2015 Ruifan Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Analyzing complex system with multimodal data, such as image and text, has recently received tremendous attention. Modeling the relationship between different modalities is the key to address this problem. Motivated by recent successful applications of deep neural learning in unimodal data, in this paper, we propose a computational deep neural architecture, bimodal deep architecture (BDA) for measuring the similarity between different modalities. Our proposed BDA architecture has three closely related consecutive components. For image and text modalities, the first component can be constructed using some popular feature extraction methods in their individual modalities. The second component has two types of stacked restricted Boltzmann machines (RBMs). Specifically, for image modality a binary-binary RBM is stacked over a Gaussian-binary RBM; for text modality a binary-binary RBM is stacked over a replicated softmax RBM. In the third component, we come up with a variant autoencoder with a predefined loss function for discriminatively learning the regularity between different modalities. We show experimentally the effectiveness of our approach to the task of classifying image tags on public available datasets.

#### 1. Introduction

Recently, there is a growing demand for analyzing the complex systems with great number of variables [1, 2], such as multimodal data with image and text, due to the availability of computational power and massive storage. For one thing, information often naturally comes in multiple modalities of a large number of variables. For example, a travel photo shared on the website is usually tagged with some meaningful words. For another, analyzing those heterogeneous data of great number of variables from multiple sources could benefit different modalities. For instance, speaker’s articulation and muscle movement can often aid in disambiguating between speeches with similar phones.

During the past few years, motivated by the biological propagation phenomena in distributed structure of human brain, deep neural learning has received considerable attention from the year of 2006. These deep neural learning methods are proposed to learn hierarchical and effective representations to facilitate various tasks with respect to recognizing and analyzing in complex artificial system. Even with only a very short development, deep neural learning has achieved great success in some tasks of modeling the single modal data, such as speech recognition systems [3–6] and computer vision systems [7–12], to name a few.

Motivated by the progress in deep neural learning, in this paper, we endeavor to construct a computational deep architecture for measuring the similarity between modalities in complex multimodal system with a large number of variables. Our proposed framework, bimodal deep architecture (BDA), has three closely consecutive components. For images and text modalities, the first component can be constructed by some popular feature extraction methods in each individual modality. The second component has two types of stacked restricted Boltzmann machines (RBMs). Specifically, for image modality a Bernoulli-Bernoulli RBM (BB-RBM) is stacked over an RBM; for text modality a BB-RBM is stacked over a replicated softmax RBM (RS-RBM). In the third component, we come up with a variant autoencoder with a predefined loss function for discriminatively learning the regularity hidden within modalities.

It is worthwhile to highlight several aspects of the BDA proposed in this paper.(i)In the first component of the BDA, for image modality, three methods are utilized in our setting. However, we could explore more feature extraction methods.(ii)In the second component of the BDA, we stack two RBMs for each modality. In theory, we could stack more RBMs to make a more effective representation.(iii)In the third component of the BDA, motivated by the deep neural architecture, we come up with a loss function to keep small distance for semantically similar bimodal data and to generate large distance for semantically dissimilar ones.(iv)The work in this paper primarily focuses on the image and text bimodal data. However, the BDA presented here can be naturally extended to other different bimodal data.

The remainder of this paper is organized as follows. Section 2 describes and discusses the related work. Section 3 presents our deep architecture and its learning algorithm. Section 4 introduces the datasets, describes the other two methods for comparisons, and reports the experimental results. Finally Section 5 draws the conclusion and discusses the future work.

#### 2. Related Work

There have been several approaches to learning from cross modal data with many variables. In particular, Blei and Jordan [13] extend latent Dirichlet allocation (LDA) by mining the topic-level relationship between images and text annotations. Xing et al. [14] build a joint model to integrate images and text, which can be viewed as an undirected extension to LDA. Jia et al. [15] proposes a combination of the undirected Markov random field and the directed LDA. However, this type of models with a single hidden layer is unable to obtain efficient representations because of the complexity of images and text.

Recently, motivated by deep neural learning, Chopra et al. [16] propose to learn a function such that the norm in the embedded space approximates the semantic distance. This learned network, however, keeps only half of the structure and only fits for unimodal data. Ngiam et al. [17] use a deep autoencoder for vision and speech fusion. Srivastava and Salakhutdinov [18] develop a deep Boltzmann machine as a generative model for images and text. However, these two works focus on cross modal retrieval but not the similarity metric.

Another line of research focuses on bimodal semantic hashing, which tries to represent data as binary codes. Subsequently, Hamming metric is applied for the learned codes as the measure of similarity. McFee and Lanckriet [19] propose a framework based on multimodal kernel learning approaches. However, this framework is limited to linear projections. Most similar to our work, Masci et al. [20] propose a framework based on the neural autoencoder to merge multiple modalities into a single representation space. However, this framework can only be used for labeled bimodal data.

#### 3. BDA for Cross Modal Similarity

The main idea of our deep framework is to construct hierarchical representations of bimodal data. This framework, as shown in Figure 1, has three closely consecutive components. In the first component, the low-level representations by classical single-modal methods for these two types of data are obtained, respectively. For images, the features are extracted by four descriptors in MPEG-7, and gist features are combined to form the low-level representations. For tag words, the typical bag-of-words (BOW) model is used for low-level representations.