Research Article | Open Access
Zheng Wang, Qingbiao Wu, "A Reweighted Scheme to Improve the Representation of the Neural Autoregressive Distribution Estimator", Computational Intelligence and Neuroscience, vol. 2018, Article ID 6401645, 9 pages, 2018. https://doi.org/10.1155/2018/6401645
A Reweighted Scheme to Improve the Representation of the Neural Autoregressive Distribution Estimator
The neural autoregressive distribution estimator(NADE) is a competitive model for the task of density estimation in the field of machine learning. While NADE mainly focuses on the problem of estimating density, the ability for dealing with other tasks remains to be improved. In this paper, we introduce a simple and efficient reweighted scheme to modify the parameters of the learned NADE. We make use of the structure of NADE, and the weights are derived from the activations in the corresponding hidden layers. The experiments show that the features from unsupervised learning with our reweighted scheme would be more meaningful, and the performance of the initialization for neural networks has a significant improvement as well.
Feature learning is one of the most important tasks in the field of machine learning. A meaningful feature representation could be the foundation of the other procedures. Among the various methods, the restricted Boltzmann machine (RBM), which is a powerful generative model, has shown its ability to learn useful representations from many different types of data [1, 2].
RBM models the higher-order correlations between dimensions of the input. It is often used as a feature extractor, or the building blocks of various deep models, for instance, deep belief nets. In the latter case, the learned representations are fed to another RBM in the higher layer, and the deep architecture often leads to better performance in many fields [3–5]. Its variants [6–8]also have the capability to deal with various kinds of tasks.
While RBM has lots of advantages, it is not suited for the problem of estimating distribution, in other words, estimating the joint probability of the observation. To estimate the joint probability of a given observation, a normalization constant must be computed, which is intractable even for a moderate size of input. To deal with the problem, some other ways must be used to approximate the normalization constant, for example, annealed importance sampling [9, 10], which is complex and computational costing.
The neural autoregressive distribution estimator (NADE)  is a powerful model for estimating the distribution of data, which is inspired by the mean-field procedure of RBM. Computing the joint probability under NADE can be done exactly and efficiently. NADE and its variants [12–17] have been shown to be state-of-the-art joint density models for a variety of datasets.
While NADE mainly focuses on the distribution of the data, it also can be regarded as an alternative model to extract features from data.
Reweight approaches have made a lot of achievements in the field of machine learning. In some models of ensemble learning, such as AdaBoost , the importance of each sample in dataset would be reweighted to achieve better results. In some deep generative models, reweight approaches have been proposed to adjust the importance weights for the procedure of importance sampling [19, 20]. With the reweight approaches, the estimation of the gradients would be more accurate.
In this paper, we deal with the feature learned by NADE and propose a novel method to improve the quality of the representation via a simple reweighted scheme of the weights learned by NADE. The proposed method remains the structure of the model, and the procedure of computation remains simple and tractable.
The remainder of the paper is structured as follows. In Section 2, we review the important architecture of RBM and NADE, which is the foundation of our method and experiments. In Section 3, we introduce and analyze the reweighted scheme to improve the quality of features learned by NADE. In Section 4, we present a similar method for the case of initialization. We provide the experimental evaluation and demonstrate the results in Section 5. Finally, we make a conclusion in Section 6.
2. Review of RBM and NADE
In this section, we review the basic RBM model and emphasize the relationship between RBM and NADE.
A restricted Boltzmann machine is a kind of Markov random field that contains one layer of visible units and one layer of hidden units . The two layers are connected with each other, and there are no connections intralayer.
The energy of the state is defined aswhere are the connecting weights between layers and are the biases of each layer.
The probability of a visible state iswhere is the normalization constant.
Due to the intractability of the normalization constant, RBM is less competitive in the task of estimating distribution.
For a given observation, the distribution can be written aswhere denotes the subvector of the observation before the i-th dimension. To evaluate the conditional distribution , a factorial distribution is used to approximate :
The minimization of the KL divergence between these two distributions leads to two important equations:where is the sigmoid function.
The main structure of NADE is inspired by the mean-field procedure , and results in the following equations:where represents the -th row in the transpose of matrix and represents the first i-1 columns of matrix , which connects the input with the corresponding hidden layers.
These two equations indicate that NADE acts like a feed-forward neural network, and the training procedure of NADE can be cast into the same framework as the common neural network while the cost function is the average negative log-likelihood of the training set. The gradient of the cost function with respect to each parameter can be derived exactly by backpropagation, and the minimization of the cost function can be done using simple stochastic gradient descent. In contrast, the gradient with respect to each parameter in RBM must be approximated by sampling from Markov chains [22–27]. Experiments have shown that NADE often outperforms other models in the task of estimating distribution, while the performance of NADE in some other tasks such as the unsupervised learning of features and initialization of neural networks is not so excellent. In this paper, we mainly deal with these two problems.
3. A Reweighted Scheme for Features
The features are totally determinated by the learned weight and the bias wherever in RBM or NADE. To improve the features, we try to modify the corresponding parameters learned by the model while keeping the structure of NADE.
A direct idea is to take advantage of the conditional probability computed by NADE. Consider the probability of one dimension of the input conditioned on the other dimensions; to measure the importance of the specified dimension, we clamp the states of the other dimensions and simply compare the probabilities of two cases as follows:
In this case, we define as the weight score for the i-th dimension of the input. Large or small value of indicates that the probabilities of these two cases vary drastically, and we should pay more attention to this specified dimension. This reweighted scheme never works in practice because of the huge amount of computation. For each dimension of every observation, we must compute two feed-forward procedures and it is impractical.
To deal with the problem, we approximate the conditional probabilities and by the fixed-order conditional probabilities and , which is compatible with the original structure of NADE. This approximation drastically reduces the cost of computation by a factor of H, which is the size of each hidden layer.
We further replace by to control the instability of . Thus, we use each to modify the corresponding weight in the matrix , in other words, the -th column of . We may pay more attention to the dimensions in which the probabilities change intensely in two different cases. These dimensions should have larger weights to generate the feature representation.
The final reweighted scheme is represented aswhere is the threshold to control the difference, D is the size of input, is the weight score of the i-th dimension, is the i-th column of , and is the final reweighted feature.
In the reweighted procedure, equation (8) computes the difference between the probabilities in two cases for each dimension. Equation (9) controls the scale of the weight. Equations (10) and (11) normalize the weight.
While this reweighted scheme seems plausible, it seldom improves the features. It may be explained that the reweighted score does change the activation in each dimension of the feature , while it does not change the relative magnitude of the activations, which may be more important for a better representation.
In order to resolve this problem, we prefer to deal with the rows of rather than columns, and we would again utilize the structure provided by NADE. For each dimension of the input, the NADE provides a corresponding hidden layer, which could be used to modify the learned features. In this case, we pay more attention to the dimensions of the hidden layers which are over saturated or inactivated. These ideas lead to the following new reweighted scheme:where is the size of input, is the i-th hidden layer in NADE, are the relative weights, are thresholds which control the value of activation, is the j-th unit in the normalized hidden layer , and represent the j-th row for each matrix.
We conclude the reweighted procedure in Algorithm 1.
This reweighted scheme for features deserves explanation a bit more. As the activations of each hidden layer form a corresponding vector of the same size, in step one, we sum the vectors and normalize it to obtain the result . Thus, is the average value of the activations for each dimension of the hidden layer, and it is a measure for how activated the dimension is during the feed-forward procedure in NADE.
We then introduce two thresholds to control the activations. The unit is considered to be over saturated if the activation is larger than the upper threshold , and the corresponding dimension of this unit is endowed with a weight . Similarly, we give a weight to the dimensions where the activations are smaller than the lower threshold . It should be noted that the weight and are relative values compared with the standard value 1. In practice, the weight should be smaller than 1 while should be larger than 1. We emphasize the importance of this procedure. The over saturated unit often affects the performance, and via this step, we set a smaller weight to alleviate this situation. While units with activation values close to zero are considered to be inactivated, these units should be kept inactivated, and some other units may even become inactivated after reweight. In this case, is negative, and a large weight for this dimension confirms the situation. According to our point of view, this procedure forces the sparsity of the representation, which often leads to better performance.
We assume that the original reweighted score for each dimension is just one and normalizes the reweighted score to keep it reasonable.
Here, we emphasize that our aim is to improve the features. When we meet the problem of estimating distribution, the original weight of NADE should be utilized since the original weight is the optimal result for the maximum likelihood cost function. Our scheme is unsuitable for density estimation.
4. A Reweighted Scheme for Initialization
The weight learned by RBM or NADE can be used to initialize the weight of another neural network, which is one of the advantages of this kind of models. The further neural network may be used in other tasks such as classification.
We have proposed the reweighted scheme for features, which is applicable for each observation, while this method is unsuitable for initialization.
To solve this problem, we compute the reweighted score for each sample in the training set and take the average of them to obtain a new reweighted score for the weight matrix and bias. This procedure can be represented aswhere is the number of samples in the training set, is the reweighted score vector corresponding to the -th training sample.
The complete process is concluded in Algorithm 2.
In this section, we show the experimental results on several binary datasets with the reweighted scheme for both features and initialization. For the training procedure of NADE, a fixed order of dimensions of the input must be chosen in the beginning. Since experiments have shown that the ordering does not have a significant impact on the performance of NADE , for each dataset, the ordering is chosen independently and is kept the same during all the experiments on it. Furthermore, hyperparameters of NADE remain unchanged in order to select the hyperparameters of the reweighted scheme. Our implementation of the NADE model is based on the code provided by Larochelle and Murray .
5.1. Results on Learned Features
To test whether the reweighted scheme has improved the learned features, we perform some experiments on classification.
We note that our main purpose is to evaluate the proposed reweighted scheme rather than pursuing the best performance for classification, and we only use a moderate size of model to reduce the cost of computation. For each dataset, we first train a NADE and use Algorithm 1 to obtain the improved features. This procedure is processed for all the samples in training set, validation set, and test set which results in all new three corresponding sets. We then train a neural network with single hidden layer as the classifier on the learned features. The performance is measured by the classification error rate on test set. We further experiment on the features without reweighted scheme to obtain a standard result for comparison. A RBM with the same size of NADE is also trained and the classification result is used as reference.
We experiment on twelve different datasets from the UCI repository: Adult, Binarized-MNIST, Connect-4, Convex, DNA, Mushrooms, Newsgroups, OCR-letters, RCV1, Rectangles, SVHN, and Web. We list the details about these datasets in Table 1. The experimental results are shown in Table 2. We have chosen the best result for reweighted scheme among the results corresponding to different hyperparameters. We find that the classification error for features with reweighted scheme is lower than the one without reweighted, which proves the improvement on the original features. Features from the reweighted scheme may be more meaningful.
To further verify our method, we replace the neural network classifier with SVM, RandomForest, and AdaBoost and perform additional experiments. These experiments are implemented via LIBSVM  and scikit-learn. The results are shown in Tables 3–5. During all the experiments, the parameters of the classifiers have been optimized by grid search and validation which would give the best performance. The features with our reweighted scheme again outperform the original ones, and it confirms the effectiveness of our method.
Experimental results for different weights on OCR-letters dataset are shown in Table 6. In this series of experiments, we train a NADE on the dataset at first, the learning rate is set to 0.001, the decrease constant is set to 0, the size of hidden layer is 100, and we use tied weight in NADE. That is, we set in equation (6). Next, we keep and only modify the reweighted parameters to explore the performance of the reweighted scheme. The results have demonstrated that the reweighted scheme has a decisive role in improving the features. An unreasonable reweighted scheme often leads to a worse result than the one without reweighted scheme. We have found that setting the lower weight larger than 1 and the upper weight smaller than 1 seems to be a reasonable reweighted scheme. In the previous sections, we have already explained this manner that a smaller value for upper weight makes the over saturated unit to be less saturated, which is beneficial for the representation, while a larger value for lower weight preserves the inactivated unit and forces the feature to be sparse.
It should also be noted that the weights are relative weight compared with the standard weight 1. Thus, the weight must be controlled and a too large or too small weight leads to a terrible result.
Another factor which influences the performance of the reweighted scheme is the thresholds that explicitly control the unit to be saturated or inactivated. Results for different thresholds on OCR-letters dataset are shown in Table 7. As the same as what we have done before, we only modify these two thresholds during this series of experiments, and we set the upper weight to 0.6 and the lower weight to 1.4. Results have also demonstrated the importance of the thresholds. On the one hand, the upper threshold controls the proportion of units which are seen to be over saturated, and a larger value of the upper threshold leads to a smaller proportion of these units. On the other hand, the lower threshold controls the proportion of units which are seen to be inactivated, and a smaller value of the lower threshold leads to a smaller proportion of these units. These units would be even more inactivated after the reweighted scheme.
From our point of view, these two thresholds depend more on the dataset rather than the specification. Still, as a conservative strategy, we prefer to set the upper threshold in the range from 0.5 to 0.8, while the lower threshold in the range from 0.5 to 0.2.
To further investigate the features, we examine and analyze the value of activations of all the features in test set of OCR-letters. Figure 1 shows the number of units corresponding to each value of activations from 0 to 1 with a step of 0.01. Units whose value of activation under 0.01 are ignored to keep the figure balance since these units make up a large majority of all units. The number of these units before reweight is 562453 and 575928 after reweight, which proves that the policy we proposed does keep the inactivated units and does even force the features more sparse. A significant decrease of the over saturated units is shown in figure, which accords with our purpose.
We also investigate the average value of activation for each dimension in the feature. The results are shown in Figure 2. We find that in the NADE features after reweight, the over saturated dimensions are restrained, while the inactivated dimensions are kept or even more inactivated. The average features before and after reweight are similar while the NADE features and RBM features vary dramatically. The difference between NADE features and RBM features is due to the intrinsic difference between the model NADE and RBM.
5.2. Results on Initialization
The reweighted scheme we have proposed also improves the performance of the neural network by initialization, which we would show here.
To test the performance, we train a NADE for each dataset, and a RBM with same size is also trained. We then use the learned weight matrix and the bias to initialize the parameters of the neural network classifier. Then, the neural network classifier is trained on the corresponding dataset. To evaluate our reweighted scheme, the parameters after reweight are also used as the initialization of another neural network classifier with same size. Finally, the performance is measured by the classification error.
We have shown the results in Table 8. As before, we perform experiments on the same twelve datasets and with the same hyperparameters. This time, from the results, we could see that the reweighted scheme for initialization has made a more significant improvement on the classification performance compared with the original NADE parameters. In most of the datasets, the difference between the errors of the reweighted-NADE and NADE is much larger than the one between NADE and RBM, which demonstrates the efficiency of the proposed reweighted scheme. In OCR-letters, the classification performance for reweighted-NADE is not as good as RBM, and this can be explained as the inherent difference between the parameters learned by NADE and RBM, which is hard to eliminate only via reweighted scheme. Anyhow, the proposed scheme always surpasses the one without reweighted.
In order to make the experiments more complete, we experiment on various weights on the Web dataset and the results are shown in Table 9. For NADE, on this dataset, the learning rate is set to 0.005, the decrease constant is set to 0, the size of hidden layer is 150, and the weight is untied. Upper threshold and lower threshold are kept to 0.67 and 0.33. The heuristic reweighted method, which sets the lower weight larger than 1 and the upper weight smaller than 1, once again proved to be effective. While this time, the proper weights are more far away from the standard weight 1. This can be explained by the effect of the average. Since we compute the average of the reweighted score of all the training samples, a more discriminated reweighted scheme maintains the differences among the dimensions in the final reweighted score vector. In other words, we prefer a larger value for lower weight and a smaller value for upper weight when dealing with the problem of initialization.
Results about various thresholds on dataset Web are shown in Table 10. Upper weight and lower weight are set to 0.6 and 1.4, respectively. We make a similar conclusion to the one in previous section. The thresholds depend more on the dataset and we prefer a conservative strategy.
In this paper, we have proposed a simple and novel reweighted scheme to modify the learned parameters of NADE. We make use of the activations in hidden layers of the learned NADE model and set appropriate thresholds to control the proportions of the over saturated and inactivated units. In order to achieve better results, a heuristic reweighted method is proposed. The original parameters are modified and normalized. The reweighted parameters are used to generate better features or to improve the performance of the initialization for a neural network. The experiments have shown the effectiveness of the reweighted scheme, and there are evident improvements in both two important tasks in the field of machine learning.
All the datasets used in this paper are publicly available and could be obtained from http://archive.ics.uci.edu/ml/datasets.html.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This study was supported by the National Natural Science Foundation of China (Grant nos. 11771393 and 11632015) and Zhejiang Natural Science Foundation (Grant no. LZ14A010002).
- Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
- Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
- G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
- R. Salakhutdinov and I. Murray, “On the quantitative analysis of deep belief networks,” in Proceedings of 25th International Conference on Machine Learning, pp. 872–879, ACM, Helsinki, Finland, July 2008.
- J. Schmidhuber, “Deep learning in neural networks: an overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
- C. Marc-Alexandre and H. Larochelle, “An infinite restricted boltzmann machine,” Neural Computation, vol. 28, no. 7, pp. 1265–1288, 2016.
- A. C. Courville, B. James, and Y. Bengio, “A spike and slab restricted Boltzmann machine,” in Proceedings of AISTATS, vol. 1, p. 5, Fort Lauderdale, FL, USA, October 2011.
- G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Proceedings of 22nd International Conference on Neural Information Processing Systems, pp. 1607–1614, Vancouver, MB, Canada, December 2009.
- Y. Burda, R. B. Grosse, and R. Salakhutdinov, “Accurate and conservative estimates of MRF log-likelihood using reverse annealing,” 2015, https://arxiv.org/abs/1412.8566.
- R. M. Neal, “Lingua::EN::Titlecase,” Statistics and Computing, vol. 11, no. 2, pp. 125–139, 2001.
- H. Larochelle and I. Murray, “The neural autoregressive distribution estimator,” in Proceedings of AISTATS, vol. 1, p. 2, Fort Lauderdale, FL, USA, October 2011.
- H. Larochelle and S. Lauly, “A neural autoregressive topic model,” in Proceedings of 22nd International Conference on Neural Information Processing Systems, pp. 2708–2716, Lake Tahoe, Nevada, December 2012.
- I. Murray and R. R. Salakhutdinov, “Evaluating probabilities under high-dimensional latent variable models,” in Proceedings of Advances in Neural Information Processing Systems, pp. 1137–1144, Vancouver, MB, Canada, December 2009.
- T. Raiko, L. Yao, K. Cho, and Y. Bengio, “Iterative neural autoregressive distribution estimator nade-k,” in Proceedings of Advances in Neural Information Processing Systems, pp. 325–333, Montreal, QC, Canada, 2014.
- B. Uria, I. Murray, and H. Larochelle, “Rnade: the real-valued neural autoregressive density-estimator,” in Proceedings of Advances in Neural Information Processing Systems, pp. 2175–2183, Lake Tahoe, NV, USA, December 2013.
- Y. Zheng, R. S. Zemel, Y.-J. Zhang, and H. Larochelle, “A neural autoregressive approach to attention-based recognition,” International Journal of Computer Vision, vol. 113, no. 1, pp. 67–79, 2014.
- Y. Zheng, Yu-J. Zhang, and H. Larochelle, “Topic modeling of multimodal data: an autoregressive approach,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1370–1377, Columbus, OH, USA, June 2014.
- Y. Freund, R. E. Schapire et al., “Experiments with a new boosting algorithm,” in Proceedings of 13th International Conference on Machine Learning, vol. 96, pp. 148–156, Bari, Italy, July 1996.
- J. . Bornschein and Y. Bengio, “Reweighted wake-sleep,” 2014, https://arxiv.org/abs/1406.2751.
- Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance weighted autoencoders,” 2015, https://arxiv.org/abs/1509.00519.
- L. K. SaulT. Jaakkola and M. I. Jordan, “Mean field theory for sigmoid belief networks,” Journal of Artificial Intelligence Research, vol. 4, no. 1, pp. 61–76, 1996.
- K. H. Cho, T. Raiko, and I. Alexander, “Parallel tempering is efficient for learning restricted Boltzmann machines,” in Proceedings of 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, Barcelona, Spain, July 2010.
- G. Hinton, “A practical guide to training restricted Boltzmann machines,” Momentum, vol. 9, no. 1, p. 926, 2010.
- G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
- J. Martens and I. Sutskever, “Parallelizable sampling of markov random fields,” in Proceedings of AISTATS, pp. 517–524, Sardinia, Italy, May 2010.
- R. R. Salakhutdinov, “Learning in markov random fields using tempered transitions,” in Proceedings of Advances in Neural Information Processing Systems, pp. 1598–1606, Vancouver, MB, Canada, December 2009.
- T. Tieleman, “Training restricted Boltzmann machines using approximations to the likelihood gradient,” in Proceedings of 25th International Conference on Machine Learning, pp. 1064–1071, ACM, Helsinki, Finland, July 2008.
- C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2001.
Copyright © 2018 Zheng Wang and Qingbiao Wu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.