Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 267478, 12 pages

http://dx.doi.org/10.1155/2015/267478

## Topologically Ordered Feature Extraction Based on Sparse Group Restricted Boltzmann Machines

^{1}School of Computer Science and Technology, Wuhan University of Technology, 122 Luoshi Road, Wuhan 430070, China^{2}State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, Wuhan 430079, China^{3}Engineering Research Center for Spatio-Temporal Data Smart Acquisition and Application, Ministry of Education of China, Wuhan University, 129 Luoyu Road, Wuhan 430079, China^{4}Institute of Information Technology, Luoyang Normal University, 71 Luolong Road, Luoyang 471022, China

Received 20 March 2015; Revised 28 July 2015; Accepted 9 September 2015

Academic Editor: Panos Liatsis

Copyright © 2015 Zhong Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

How to extract topologically ordered features efficiently from high-dimensional data is an important problem of unsupervised feature learning domains for deep learning. To address this problem, we propose a new type of regularization for Restricted Boltzmann Machines (RBMs). Adding two extra terms in the log-likelihood function to penalize the group weights and topologically ordered factors, this type of regularization extracts topologically ordered features based on sparse group Restricted Boltzmann Machines (SGRBMs). Therefore, it encourages an RBM to learn a much smoother probability distribution because its formulations turn out to be a combination of the group weight-decay and topologically ordered factor regularizations. We apply this proposed regularization scheme to image datasets of natural images and Flying Apsara images in the Dunhuang Grotto Murals at four different historical periods. The experimental results demonstrate that the combination of these two extra terms in the log-likelihood function helps to extract more discriminative features with much sparser and more aggregative hidden activation probabilities.

#### 1. Introduction

Restricted Boltzmann Machines (RBMs) [1] are a type of product of experts model [2] based on Boltzmann Machines [3] but with a complete bipartite interaction graph. In general, RBMs, which are used as generative models to simulate input distributions of binary data [4], are viewed as an effective feature-representation approach for extracting structured information from input data. They have received much attention recently and have been successfully applied in various application domains, such as dimensionality reduction [5], object recognition [6], topic modeling [7], and feature learning [8]. In addition, RBMs have attracted much attention as building blocks for the multilayer learning systems (e.g., Deep Belief Networks (DBNs), Deep Boltzmann Machines (DBMs)), and variants and extensions of RBMs have a great many applications in a wide range of feature learning and pattern recognition tasks.

Due to the arbitrary connectivity of Boltzmann machines, they are too slow to be practical, and in order to obtain efficient and exact results, RBMs have the restrictions that there are no visible-visible or hidden-hidden connections, which leads to the obvious advantage that inferences in the RBMs are much easier than in Boltzmann Machines [9]. Therefore, the hidden units are conditionally independent and we may generate a more powerful learning model [10]. Lee et al. [11] proposed sparse RBMs (SRBMs) by pointing out that RBMs tend to learn distributed and nonsparse representations as the number of hidden units is increased; accordingly, they added a regularization term that penalized a deviation of the expected activation with a low level to ensure that the hidden units would be sparsely activated. Moreover, in order to group similar activations of the hidden units and capture their local dependencies, Luo et al. [12] proposed sparse group RBMs (SGRBMs) using a novel regularization of the activation probabilities of the hidden units in RBMs. What SRBMs and SGRBMs have in common is that they have adopted sparsity to promote regularization, making them powerful enough to represent complicated distributions.

By introducing the regularizer into the activation probabilities of the hidden units, the SGRBMs have the following two properties: first, this model encourages few groups to be active when given observed data (this property yields sparsity at group level), and second, it results in only a few hidden units being active in a group (this property yields sparsity within the group). However, they did not consider overfitting problems, which lack corresponding strategies for controlling the reconstruction complexity of the weight matrix. In addition, they did not take into account the fact that all the extracted features in the hidden units are not topologically ordered (i.e., similar features are grouped together while they do not simultaneously discard group sparsity), and it is essential for a learning machine to obtain structured information from the input data. In 2002, Welling et al. [13] proposed a novel learning sparse topographic representation with products of Student -distributions and found that if the Student -distribution is used to model the combined outputs of sets of neutrally adjacent filters, then the orientation, spatial frequency, and location of the filters change smoothly across the topographic map. Later, Goh et al. [14] proposed a method for regularizing RBMs during training to obtain features that are sparse and topographically organized. The features learned are then Gabor-like and demonstrate a coding for orientation, spatial position, frequency, and color that vary smoothly with the topography of the feature map. For the purpose of efficiently extracting invariant features with group sparsity from high-dimensional data, in this paper, firstly we adopted a weight-decay strategy [15, 16] at group level based on SGRBMs, and secondly, by adding an extra term to penalize the topologically ordered factors in the log-likelihood function, the topologically ordered features at group level can be obtained.

The remaining sections of this paper are organized as follows. In Section 2, RBMs and Contrastive Divergence algorithms for RBM training are described in brief. In Section 3, a nontopologically ordered feature extraction approach is proposed to obtain sparse but not topologically ordered features between groups from the input data. In Section 4, a topologically ordered feature extraction approach is proposed to obtain structured information (i.e., sparse and topologically ordered features between the overlapping groups) from the input data. In Section 5, experimental results with two different datasets (namely, natural images and Flying Apsara images in the Dunhuang Grotto Murals) are shown to validate the proposed approach. Finally, the conclusions are in Section 6.

#### 2. Restricted Boltzmann Machines and Contrastive Divergence

RBMs are a particular form of the Markov Random Field (MRF) model and are regarded as an undirected generative model which uses a layer of binary hidden units to model a distribution over binary visible units [17]. Suppose an RBM consists of visible units representing the input data and hidden units to capture the features of the input data. The joint probability distribution is given by the Gibbs distribution with the energy function [17, 18]:where is the matrix of weights and and are vectors which represent the visible and hidden biases, respectively. All these are referred to as the RBM parameters , and is the energy function and is a corresponding normalized constant. Therefore, the marginal distribution of visible variables becomes

As the hidden units are independent given the states of the visible units and vice versa, when given the observed data, the conditional probabilities and conditional distributions of the hidden units arewhere is the th column of , which is a vector that represents the connection weights between the th hidden unit and all visible units, and is the sigmoid activation function. Thus, the marginal distribution of the visible variables actually is a model of product of experts [2, 12]:Equation (4) deduces that all these hidden units for the individual components of the given data vector are combined multiplicatively and will contribute probabilities according to the activation probabilities. If given a data sample, one specific hidden unit will be activated with a high probability, and the hidden unit is responsible for representing the data sample. If more data in the training data set activates a hidden unit with a higher probability, the hidden unit’s feature will be less discriminative. Thus it is sometimes necessary to introduce sparsity at the hidden layer of an RBM [11, 19, 20].

For a training example , training an RBM is the same as modeling the marginal distribution of the visible units. A common practice is to adopt the log-likelihood gradient approach [16, 18] to maximizing the marginal distribution , which aims to generate with the largest probability. Using gradient descent approach can solve this problem:The second term of (5) is intractable because we cannot obtain any information about the marginal distribution . In order to solve this problem, Hinton et al. [21] proposed Contrastive Divergence (CD) learning, which has become a standard way to train RBMs. The -step contrastive divergence learning () is in two steps: first, the Gibbs chain is initialized with the training example of the training set. Second, the sample is yielded after steps, and each step consists of sampling from and subsequently sampling from . According to the general Markov Chain Monte Carlo (MCMC) theory, we know that when , the -step contrastive divergence learning algorithm converges to the second term of (5) and becomes only visible in the proof of Bengio and Delalleau [22]. However, Hinton [16] pointed out that when initializing with the training example of the training set, running one-step () Gibbs sampling approximates this term in the log-likelihood gradient relatively well. Therefore, (5) can be approximated asThus, the iterative update process of the th column from the weight matrix for the training example is represented bywhere is the learning rate. The first term of (7) decreases the energy of [23]; at the same time, this term also guarantees that unit is more likely to be activated when the hidden unit observes again; this means that the hidden units are learning to represent [12]. In the next section, we use a weight-decay strategy at group level for SGRBMs to capture features with group sparsity from the input data.

#### 3. Nontopologically Ordered Feature Extraction Based on Sparse Group RBMs

In the unsupervised learning process, some of the hidden units may extract similar features if there is little difference between their corresponding weight vectors . This homogenization problem can be obvious and serious if the number of the hidden units is increased. To alleviate this problem, Lee et al. [11] introduced SRBMs and Luo et al. [12] introduced SGRBMs to remit statistical dependencies between all of the hidden units when adding a penalty term. SRBMs have been popular due to the fact that an RBM with a low average hidden activation probability is better at extracting discriminative features than nonregularized RBMs [8, 24]. This is especially the case in Luo et al. [12], who divided the hidden units equally into nonoverlapping groups to restrain the dependencies within these groups and penalized the overall activation level of a group. More discriminative features are learned when SGRBMs are applied to deep learning systems for classification tasks. However, Luo et al. [12] did not consider overfitting problems and did not propose any strategies for controlling the reconstruction complexity of the weight matrix. Thus, to equilibrate the reconstruction error (i.e., the learning accuracy of specific training samples) and reconstruction complexity (generalization ability), we have used a weight-decay strategy at group level based on SGRBMs to capture features with sparse grouping of the input data.

For an RBM with hidden units, let denote the set of all indices of the hidden units. The th group is denoted by , where , . Suppose all groups are nonoverlapping and of equal size [12] (see Figure 1). Given a grouping and a sample , the th group norm is given bywhere is the (Euclidean) norm of the vector comprising the activation probabilities , which are considered as the overall activation level of the th group. Given all the group norms , the mixed norm is