Computational Intelligence and Neuroscience

Volume 2018, Article ID 6401645, 9 pages

https://doi.org/10.1155/2018/6401645

## A Reweighted Scheme to Improve the Representation of the Neural Autoregressive Distribution Estimator

School of Mathematical Sciences, Zhejiang University, HangZhou, Zhejiang, China

Correspondence should be addressed to Qingbiao Wu; nc.ude.ujz@uwbq

Received 21 April 2018; Revised 29 October 2018; Accepted 21 November 2018; Published 23 December 2018

Academic Editor: Saeid Sanei

Copyright © 2018 Zheng Wang and Qingbiao Wu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The neural autoregressive distribution estimator(NADE) is a competitive model for the task of density estimation in the field of machine learning. While NADE mainly focuses on the problem of estimating density, the ability for dealing with other tasks remains to be improved. In this paper, we introduce a simple and efficient reweighted scheme to modify the parameters of the learned NADE. We make use of the structure of NADE, and the weights are derived from the activations in the corresponding hidden layers. The experiments show that the features from unsupervised learning with our reweighted scheme would be more meaningful, and the performance of the initialization for neural networks has a significant improvement as well.

#### 1. Introduction

Feature learning is one of the most important tasks in the field of machine learning. A meaningful feature representation could be the foundation of the other procedures. Among the various methods, the restricted Boltzmann machine (RBM), which is a powerful generative model, has shown its ability to learn useful representations from many different types of data [1, 2].

RBM models the higher-order correlations between dimensions of the input. It is often used as a feature extractor, or the building blocks of various deep models, for instance, deep belief nets. In the latter case, the learned representations are fed to another RBM in the higher layer, and the deep architecture often leads to better performance in many fields [3–5]. Its variants [6–8]also have the capability to deal with various kinds of tasks.

While RBM has lots of advantages, it is not suited for the problem of estimating distribution, in other words, estimating the joint probability of the observation. To estimate the joint probability of a given observation, a normalization constant must be computed, which is intractable even for a moderate size of input. To deal with the problem, some other ways must be used to approximate the normalization constant, for example, annealed importance sampling [9, 10], which is complex and computational costing.

The neural autoregressive distribution estimator (NADE) [11] is a powerful model for estimating the distribution of data, which is inspired by the mean-field procedure of RBM. Computing the joint probability under NADE can be done exactly and efficiently. NADE and its variants [12–17] have been shown to be state-of-the-art joint density models for a variety of datasets.

While NADE mainly focuses on the distribution of the data, it also can be regarded as an alternative model to extract features from data.

Reweight approaches have made a lot of achievements in the field of machine learning. In some models of ensemble learning, such as AdaBoost [18], the importance of each sample in dataset would be reweighted to achieve better results. In some deep generative models, reweight approaches have been proposed to adjust the importance weights for the procedure of importance sampling [19, 20]. With the reweight approaches, the estimation of the gradients would be more accurate.

In this paper, we deal with the feature learned by NADE and propose a novel method to improve the quality of the representation via a simple reweighted scheme of the weights learned by NADE. The proposed method remains the structure of the model, and the procedure of computation remains simple and tractable.

The remainder of the paper is structured as follows. In Section 2, we review the important architecture of RBM and NADE, which is the foundation of our method and experiments. In Section 3, we introduce and analyze the reweighted scheme to improve the quality of features learned by NADE. In Section 4, we present a similar method for the case of initialization. We provide the experimental evaluation and demonstrate the results in Section 5. Finally, we make a conclusion in Section 6.

#### 2. Review of RBM and NADE

In this section, we review the basic RBM model and emphasize the relationship between RBM and NADE.

A restricted Boltzmann machine is a kind of Markov random field that contains one layer of visible units and one layer of hidden units . The two layers are connected with each other, and there are no connections intralayer.

The energy of the state is defined aswhere are the connecting weights between layers and are the biases of each layer.

The probability of a visible state iswhere is the normalization constant.

Due to the intractability of the normalization constant, RBM is less competitive in the task of estimating distribution.

For a given observation, the distribution can be written aswhere denotes the subvector of the observation before the *i*-th dimension. To evaluate the conditional distribution , a factorial distribution is used to approximate :

The minimization of the KL divergence between these two distributions leads to two important equations:where is the sigmoid function.

The main structure of NADE is inspired by the mean-field procedure [21], and results in the following equations:where represents the -th row in the transpose of matrix and represents the first *i*-1 columns of matrix , which connects the input with the corresponding hidden layers.

These two equations indicate that NADE acts like a feed-forward neural network, and the training procedure of NADE can be cast into the same framework as the common neural network while the cost function is the average negative log-likelihood of the training set. The gradient of the cost function with respect to each parameter can be derived exactly by backpropagation, and the minimization of the cost function can be done using simple stochastic gradient descent. In contrast, the gradient with respect to each parameter in RBM must be approximated by sampling from Markov chains [22–27]. Experiments have shown that NADE often outperforms other models in the task of estimating distribution, while the performance of NADE in some other tasks such as the unsupervised learning of features and initialization of neural networks is not so excellent. In this paper, we mainly deal with these two problems.

#### 3. A Reweighted Scheme for Features

The features are totally determinated by the learned weight and the bias wherever in RBM or NADE. To improve the features, we try to modify the corresponding parameters learned by the model while keeping the structure of NADE.

A direct idea is to take advantage of the conditional probability computed by NADE. Consider the probability of one dimension of the input conditioned on the other dimensions; to measure the importance of the specified dimension, we clamp the states of the other dimensions and simply compare the probabilities of two cases as follows:

In this case, we define as the weight score for the *i*-th dimension of the input. Large or small value of indicates that the probabilities of these two cases vary drastically, and we should pay more attention to this specified dimension. This reweighted scheme never works in practice because of the huge amount of computation. For each dimension of every observation, we must compute two feed-forward procedures and it is impractical.

To deal with the problem, we approximate the conditional probabilities and by the fixed-order conditional probabilities and , which is compatible with the original structure of NADE. This approximation drastically reduces the cost of computation by a factor of *H*, which is the size of each hidden layer.

We further replace by to control the instability of . Thus, we use each to modify the corresponding weight in the matrix , in other words, the -th column of . We may pay more attention to the dimensions in which the probabilities change intensely in two different cases. These dimensions should have larger weights to generate the feature representation.

The final reweighted scheme is represented aswhere is the threshold to control the difference, *D* is the size of input, is the weight score of the *i*-th dimension, is the *i*-th column of , and is the final reweighted feature.

In the reweighted procedure, equation (8) computes the difference between the probabilities in two cases for each dimension. Equation (9) controls the scale of the weight. Equations (10) and (11) normalize the weight.

While this reweighted scheme seems plausible, it seldom improves the features. It may be explained that the reweighted score does change the activation in each dimension of the feature , while it does not change the relative magnitude of the activations, which may be more important for a better representation.

In order to resolve this problem, we prefer to deal with the rows of rather than columns, and we would again utilize the structure provided by NADE. For each dimension of the input, the NADE provides a corresponding hidden layer, which could be used to modify the learned features. In this case, we pay more attention to the dimensions of the hidden layers which are over saturated or inactivated. These ideas lead to the following new reweighted scheme:where is the size of input, is the *i*-th hidden layer in NADE, are the relative weights, are thresholds which control the value of activation, is the *j*-th unit in the normalized hidden layer , and represent the *j*-th row for each matrix.

We conclude the reweighted procedure in Algorithm 1.