Computational Intelligence and Neuroscience

Volume 2015, Article ID 650527, 9 pages

http://dx.doi.org/10.1155/2015/650527

## Learning Document Semantic Representation with Hybrid Deep Belief Network

^{1}Department of Computer Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China^{2}Key Laboratory of Computational Linguistics, Peking University, Ministry of Education, Beijing 100871, China^{3}Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

Received 26 September 2014; Revised 2 March 2015; Accepted 9 March 2015

Academic Editor: Pasi A. Karjalainen

Copyright © 2015 Yan Yan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

High-level abstraction, for example, semantic representation, is vital for document classification and retrieval. However, how to learn document semantic representation is still a topic open for discussion in information retrieval and natural language processing. In this paper, we propose a new Hybrid Deep Belief Network (HDBN) which uses Deep Boltzmann Machine (DBM) on the lower layers together with Deep Belief Network (DBN) on the upper layers. The advantage of DBM is that it employs undirected connection when training weight parameters which can be used to sample the states of nodes on each layer more successfully and it is also an effective way to remove noise from the different document representation type; the DBN can enhance extract abstract of the document in depth, making the model learn sufficient semantic representation. At the same time, we explore different input strategies for semantic distributed representation. Experimental results show that our model using the word embedding instead of single word has better performance.

#### 1. Introduction

Semantic representation [1–3] is very important in document classification and document retrieval tasks. Currently the main representation method is bag-of-words, but this method only contains the word frequency information, which is very shallow, and this representation is not enough. Therefore, many researchers began to explore deeper representations. LSI [4] and pLSI [5] are two kinds of dimension reduction methods which use SVD (Singular Value Decomposition) to operate on a document vector matrix and remap it in a smaller semantic space than the original one. But this method can still only capture very limited relations between words. Blei et al. [6] proposed Latent Dirichlet Allocation (LDA) that can extract some document topics which has shown superior performance over LSI and pLSI. This method is popular in the field of topic model; in the meantime, it is also considered a great method for reducing dimensions. But this method has some disadvantages: semantic features of the study are not sufficient for the documents, exact inferences in the directed model are intractable [7, 8], and it cannot properly deal with documents of different lengths.

More recently, deep learning [9] has evolved as a new branch of the research field of machine learning. This method greatly enhances semantic representation of the document. Some researchers have started the work; for example, Hinton and Salakhutdinov proposed a two-layer undirected graphical model [10] called “Replicated Softmax model” (RSM) to explore the use of basic deep learning methods to represent the document information, which had a better result than LDA method; Larochelle proposed “Doc Neural Autoregressive Distribution Estimation” (DocNADE) that was inspired by RSM and similar to an autoencoder neural network [8, 11]. In this model the input vector of observations has the same size as the output vector. This method showed that the DocNADE is competitive not only as a generative model of document but also as a learning algorithm for extracting meaningful representation of documents. The high-level abstractions through these deep network models achieve higher generalization than probabilistic topic models in terms of unseen data [12]. However these methods use the weight-sharing method and only have two layers, which is not enough to sufficiently learn about deeper representation. Because the document is missing large quantity of information in the dimension reduction process, the high level of models for different documents indicates there is little difference of learning, which did not lead to very good result in document classification and retrieval. In document classification and retrieval tasks, the most important and only used factor is the high-level document vector distributed representation of the model, so we must find a vector that can represent most messages that indicate the current document.

Based on the above disadvantage, in this paper, we propose a new model called Hybrid Deep Belief Network (HDBN) which is improved by the DBN (Deep Belief Network) [13]. First, we use two-layer DBM (Deep Boltzmann Machine) [14] which can be utilized to reduce the dimension and remove noise in the lower layer of the HDBN model to extract abstract of the document, and preserving the most of the documents’ information, which is an effective way to improve performance; secondly, we use the DBN model reduction effect again to obtain the deeper document representation in the output nodes dimension.

Then, we perform a variety of experimental studies for learning document semantic representation with HDBN. In the first part of our experiments, we compare our HDBN model with the RSM, DocNADE, and regular DBN model, and the experiment result shows that our HDBN model has a better result for document classification and retrieval in two datasets. In order to get an even better semantic representation of a document based on deep learning, we also explore the effects of different inputs on the model. In the second part of our experiments, we conduct several experiments about semantic representation, and the experiment results are elaborated in Section 4.3.

#### 2. Hybrid Deep Belief Network

Deep Learning is based on distributed representations, and different numbers and sizes of layers can be used to provide different amounts of abstraction. The higher level is developed from the lower level, and these models are often composed of a greedy layer-by-layer method. The whole model includes the pretraining and fine-tuning processes, which are helpful to explore the high-level abstraction. In this section, we first describe the notable deep learning method, Deep Belief Network (DBN), and Deep Boltzmann Machine (DBM). Then we introduce our improved deep learning model, HDBN, and its training method.

##### 2.1. Deep Belief Network

Hinton and Salakhutdinov [9] introduced a moderately fast, unsupervised learning algorithm for deep models called Deep Belief Networks (DBN). The DBN can be viewed as a composition of stacked Restricted Boltzmann Machines (RBMs) that contain visible units and hidden units. The visible units represent the document data and the hidden units represent features learned from the visible units. Restricted Boltzmann Machine [15] is a generative neural network that can learn probability distribution over its set of inputs. An RBM is a kind of Boltzmann Machine in which all the visible units are connected with hidden units while having no connection within visible layer.

Each RBM layer can capture high correlations of hidden features between itself and the layer below. An RBM can be used as a feature extractor. After successful learning, an RBM gets a closed-form representation from the training data. In the training process, Gibbs samples are useful to obtain an estimator of the log-likelihood gradient. An RBM is composed of both visible units and hidden units. When a visible unit is clamped to the observed input vector, first we can get a hidden unit from and then get a new visible unit from unit by the Gibbs sampling. Although when using the Gibbs sampling we can get the log-likelihood function on the unknown parameters of the gradient approximation, typically it takes a larger number of steps in the sampling, which makes the efficiency of RBM training low, especially when we have the observation data with high-dimensions. Hinton proposed the idea of -step Contrastive Divergence (CD-) which has become a fast algorithm for training RBM [12, 13]. The surprising empirical result is that even when (CD-1), it still can get good results. Contrastive Divergence has been used as a successful update rule to approximate the log-likelihood gradient in training RBMs [16, 17]. Through this Contrastive Divergence algorithm, we can improve the efficiency of the model training.

##### 2.2. Deep Boltzmann Machine

Deep Boltzmann Machine [14, 18, 19] is a network of symmetrically coupled stochastic binary units, and it is also composed of RBMs. It contains a set of visible units and hidden units. Unlike DBN model, all connections between layers in the DBM model are undirected. DBM has many advantages: it retains and discovers layers presentation of the input with an efficient pertaining procedure; it can be trained on unlabeled data and parameters of all layers can be optimized jointly in the likelihood function. However, DBM has a disadvantage that the training time grows exponentially with the machine’s size, and the number of connection layers, which makes large-scale learning of DBM model uneasy. So we just reduce the document dimension and remove noise with DBM model in the lower layers and then continue training with DBN model, which guarantees that the document can have a good feature extraction and reduces training time at the same time when we need some layers in the model.

##### 2.3. HDBN Model

###### 2.3.1. Principle Analysis

DBM composed of two-layer RBMs can learn better representation because when training parameters, each state of the hidden layer node is determined by the lower and higher level together which directly connected to the layer, and this is the model’s characteristic which is undirected graph model and motivation of our models using DBM training. Besides, we analyze the data structure of the documents and find that using DBM can remove noise brought by the document input. But judging from the combined effect, the effect of DBM with more than two layers is not as good as that of DBN model with more than two layers. One reason is that when using DBN to train, the parameters are prone to be overfitting; another reason is that the DBM training has much higher complexity than the DBN training, and its training time is three times that of DBN [14].

Summarily, the directed graph model (DBN) has some limitations, since it only uses visible nodes from previous layer. Therefore, the DBM model proposed by Salakhutdinov and Hinton [19] uses the undirected graph model. DBM uses both the former layer and the next layer’s nodes, and the hidden nodes sampling will be more accurate. Figure 1 shows the construction and learning of hidden layer nodes in DBM and DBN, where is an approximate posterior distribution. For DBN, nodes are constructed as in the way in directed graphs and only depend on their previous layer. However, for DBM, nodes are constructed and learned as in the way in undirected graphs. We hope our proposed hybrid model (HDBN) can combine several advantages from both directed and undirected graph models.