Convergence Analysis of Contrastive Divergence Algorithm Based on Gradient Method with Errors

Ma, Xuesi; Wang, Xiaojie

doi:https://doi.org/10.1155/2015/350102

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusions Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2015 | Article ID 350102 | https://doi.org/10.1155/2015/350102

Convergence Analysis of Contrastive Divergence Algorithm Based on Gradient Method with Errors

Xuesi Ma^1,2and Xiaojie Wang¹

Academic Editor: Julien Bruchon

Received12 May 2015

Accepted01 Jul 2015

Published12 Jul 2015

Abstract

Contrastive Divergence has become a common way to train Restricted Boltzmann Machines; however, its convergence has not been made clear yet. This paper studies the convergence of Contrastive Divergence algorithm. We relate Contrastive Divergence algorithm to gradient method with errors and derive convergence conditions of Contrastive Divergence algorithm using the convergence theorem of gradient method with errors. We give specific convergence conditions of Contrastive Divergence learning algorithm for Restricted Boltzmann Machines in which both visible units and hidden units can only take a finite number of values. Two new convergence conditions are obtained by specifying the learning rate. Finally, we give specific conditions that the step number of Gibbs sampling must be satisfied in order to guarantee the Contrastive Divergence algorithm convergence.

1. Introduction

Deep belief networks have recently been successfully applied to resolve many problems [1–5]. Restricted Boltzmann Machines (RBMs), one of important blocks of deep belief networks, have also been widely applied in many fields [2, 6–11]. The learning of RBMs and deep belief network has been an important and hot topic in machine learning research. The learning process is a parameters estimating problem. The general parameters estimating method is challenging, Hinton proposed Contrastive Divergence (CD) learning algorithm [12]. Although it has been widely used for training deep belief networks, its convergence is still not clear. Recently, more and more researchers have studied theoretical characters of CD. Bengio and Delalleau [13] proved the use of a short Gibbs chain of length to obtain a biased estimator of the log-likelihood gradient. Akoho and Takabatake [14] gave an information geometrical interpretation of the CD learning algorithm. Sutskever and Tieleman [15] gave proofs showing CD is not the gradient of any function. It is possible to construct regularization functions that cause it to fail to converge. Yuille [16] related CD to the stochastic approximation literature and derived elementary conditions which ensure convergence (with probability 1). However, convergence conditions are relatively strict; particularly, convergence conditions are related to the model parameter which minimize the Kullback-Leibler divergence between the empirical distribution function of the observed data and the model .

In this paper, we study the convergence of the CD learning algorithm. By exploring the relation between the CD algorithm and the gradient method with errors, we obtain convergence conditions of CD using the convergence theorem of gradient method with errors. Our convergence conditions are more practical than those given by Yuille [16]. We also give an analysis of convergence of the CD algorithm for RBMs, especially the convergence conditions of the CD algorithm for RBMs in which both visible units and hidden units only take a finite number of values. We give two new convergence conditions by specifying the learning rate. Finally, we give the theoretical analysis of convergence conditions of the CD algorithm for RBMs and the relationship which the learning rate and the step number of Gibbs sampling must satisfy in order to guarantee the CD algorithm convergence.

The rest of the paper is organized as follows. In Section 2, we give a brief overview of the CD algorithm. In Section 3, we firstly propose the gradient method with errors and convergence theorem of the gradient method with errors and then relate the CD algorithm to the gradient method with errors. Convergence conditions of the CD algorithm are derived. In Section 4, we give an analysis of convergence conditions of the CD algorithm for RBMs. We draw some conclusions in Section 5.

2. Contrastive Divergence Learning Algorithm

Given a probability distribution over a vector , where is a normalization constant or partition function, is hidden variable, and is an energy function. This class of random-field distribution has been used in many fields.

The marginal likelihood is

The gradient of the marginal log-likelihood with respect to model parameter is

The log-likelihood gradient algorithm can be expressed as where denotes the learning rate at th update.

The first term in the bracket of the right hand of (4) can be computed exactly; however, the second term (also called the expectation under the model distribution) is intractable because the calculation of is extremely difficult. In order to apply the log-likelihood gradient algorithm, we have to do alternating blocked-Gibbs sampling from the conditionals and . This requires an infinite number of Gibbs transitions per update to fully characterize the expectation. Hinton [12] proposed a modification of the log-likelihood gradient algorithm known as Contrastive Divergence.

The idea of -step Contrastive Divergence learning (CD-) is simple: instead of approximating the second term in the log-likelihood gradient by a sample for RBM-distribution (which would require running a Markov chain until the stationary distribution is reached), a Gibbs chain is run for only steps. The Gibbs is initialized with a training example of the training set and yields the sample after steps. Each step consists of sampling from and subsequently sampling from . The gradient (3) with of log-likelihood for one training example is approximated by The expectation of the CD algorithm can be ascribed by where is the empirical distribution function on the samples obtained by the data and running the Markov chain forward for steps, .

The asymptotic unbiased estimator of the parameters can be obtained by using the log-likelihood gradient algorithm; asymptotic property of the estimator of the parameters in CD- learning is discussed in the next section.

3. Convergence of Contrastive Divergence Algorithm

In this section, we will study the convergence of the CD learning algorithm. The CD algorithm has a similar form with the gradient method with errors. We try to relate the CD algorithm to the gradient method with errors and derive the convergence conditions of CD. For achieving it, we firstly propose the gradient method with errors and give the convergence theorem of the gradient method with errors.

3.1. Gradient Methods with Errors and Convergence Theorem

Given the optimization problem, where denotes the -dimensional Euclidean space and is a continuously differentiable function, such that for a positive constant we have where , stands for the Euclidean norm in .

The gradient method with errors is of the following form: where is a positive step-size sequence, is a descent direction, and is an error.

The error could be deterministic or stochastic. In both cases, the gradient method has been studied in literature [17–20]. We consider that is stochastic because of the CD algorithm in this paper. The gradient method with stochastic errors can be considered as stochastic approximation algorithm or stochastic approximation procedure [21, 22]. Younes [22] has analyzed the convergence of stochastic approximation procedure (SAP) and has given the almost sure convergence conditions of SAP using ODE (Ordinary Differential Equations) approach. He generated a persistent Markov chain and studied the recursion algorithm in which several iterations of the simulation procedures are performed before updating the current parameter, with the updating being done using the average of the obtained values. Bertsekas and Tsitsiklis [18] have studied the convergence of gradient method, in which the expectation of the stochastic error is zero with probability 1. We present a gradient method with different stochastic errors; convergence of the gradient method is guaranteed by the following theorem. We will need a known lemma, which has been proved by Grimmett and Stirzaker [23].

Lemma 1. Let and be three nonnegative random variables such that with probability 1 and for all ; then, converges with probability 1 and .

Theorem 2. Let be a sequence generated by the methodwhere is a positive step size, is a descent direction, and is a stochastic error. Let be an increasing sequence of -fields ( should be interpreted as the history of the algorithm up to time ). We assume the following:(1), and are -measurable.(2)There exist positive scalars and such that(3)We have, for all , and with probability 1, where and are constants, .(4)We have Then, converges.

Proof. With fixed two vectors and , let be a scalar parameter, and let . The chain rule yields . We have We apply (14) with and . We obtain Using our assumptions, the relations and , we have Taking conditional expectations with respect to and using conditions and the relation , we havewhich for sufficiently large can be written aswhere and are positive scalars.
Using the assumption that is -measurable, we have By using Lemma 1 and the assumption , we see that converges.
The theorem is proved.

Strictly speaking, the conclusion of the theorem only holds with probability 1. For simplicity, an explicit statement of this qualification often will be omitted. We will use the theorem to derive convergence conditions of CD based on the similarity between the CD algorithm and the gradient method with errors.

3.2. Convergence of CD

In order to derive convergence conditions of the CD learning algorithm using the convergence theorem of the gradient method with errors, we have to explore the relation between the CD algorithm and the gradient method with errors. We can reconstruct the CD algorithm in the form of gradient optimization problem.

The theorem of the gradient method with errors involves four basic concepts. The first is an optimization function , which must be a continuously differentiable function, such that, for some constant , we have . The second is the descent direction . The third is the error vector . The last concept is the step size ; can be considered as the learning rate in the CD learning algorithm. The gradient method with errors will converge provided the conditions of Theorem 2 are satisfied.

We firstly design an optimal function for CD. Let We can derive convergence theorem for the CD learning algorithm through selecting appropriate and . Next, we give the convergence theorem of the CD learning algorithm using the convergence theorem of the gradient method with errors.

Theorem 3. The CD learning algorithm will converge providing(1), where is a positive constant,(2),(3), where is a positive constant and is bounded.

Proof. The CD algorithm can be described as the form of gradient optimization problem. The CD algorithm is In (21), let The CD algorithm in the form of gradient optimization problem can be described as follows: In (23), let Then, (23) can be considered as the gradient method with errors.
Since , then we have Therefore, satisfies for positive scalars and : Using (25), And using the condition of Theorem 3, we assume the upper bound of is ; then, we have So, satisfies for positive scalars and : Using the assumption that the upper bound of is , we haveThen, . We let ; then, we have By using Theorem 2, we have converging; then, the CD learning algorithm will converge.
The theorem is proved.

We derive convergence conditions of the CD algorithm in the above theorem. It is easy to find that convergence conditions mainly include three aspects of contents. The first is the function of the parameter . The second is the learning rate of the CD learning algorithm. The third includes two terms. The first term is about the error between the empirical distribution function and the distribution function ; this term can be controlled by the number of Gibbs sampling. The second term is the value related to the energy function .

These convergence conditions derived here are different from conditions that were obtained by Yuille [16]; convergence conditions which were obtained by Yuille are related to the model parameter; our convergence conditions are not related to the model parameter. Because the task of learning is to estimate the model parameter, the model parameter is generally unknown; convergence conditions in this paper have more practical significance than convergence conditions that were obtained by Yuille.

3.3. The Learning Rate and Convergence Conditions

In convergence conditions of the CD algorithm, the condition which must satisfy is a necessary condition; it is Basing the fact that and , we assume and ; then, we have the following new convergence conditions derived from Theorem 3.

Corollary 4. The CD learning algorithm will converge providing(1), where is a positive constant,(2), ,(3), where is a positive constant and is bounded.

3.4. Consistency of CD

It is clear that CD is equivalent to the Monte Carlo version of the log-likelihood gradient descent as the number the MCMC step goes to infinity, because the empirical distribution converges to the distribution . It is known that CD gives a good solution; even is relatively small. Akoho and Takabatake [14] give an intuitive interpretation about the reason why CD can approximate well by means of information geometry. In the above sections, we study the convergence of the CD algorithm; now, we consider the consistency of the CD algorithm. If is a limit point of , then converges to the finite value by Theorem 2; then, is a stationary point of (); furthermore, every limit point of is a stationary point of . It is known that CD is an approximation of the log-likelihood gradient, the convergence conditions of Theorem 3 assure the error of the approximation is small enough to make CD converging. If the convergence conditions of Theorem 3 are satisfied, CD will converge. We know the conclusions of Theorems 2 and 3 hold with probability 1. We can obtain the following conclusion: if the CD algorithm converges with probability 1, the convergence point is consistent with the stationary point of the optimal function , which is a local optimum in general.

4. Convergence of CD Algorithm for RBMs

In this section, we consider the convergence of the CD algorithm for RBMs. In the following, we consider the case where both visible units and hidden units only take a finite number of values.

4.1. Convergence Conditions of CD Algorithm for RBMs

The RBMs structure is a bipartite graph consisting of one layer of visible variables and one layer of hidden variables . The model distribution is given by , where , and the energy function is given by with are real-valued parameters which are denoted by .

In Section 3, we have already considered convergence of the CD algorithm and derived convergence theorem for the CD learning algorithm based on the convergence theorem of gradient method with errors. Now, we give the convergence theorem of the CD learning algorithm for RBMs.

Theorem 5. The CD learning algorithms for RBMs will converge providing(1),(2), where is a positive constant.

Proof. To prove the convergence of the CD learning algorithm for RBMs, we have to prove the CD algorithm satisfying three conditions of Theorem 3. Obviously, we can obtain the second condition of Theorem 3 from the first condition. Next, we prove the CD algorithm satisfying other two conditions of Theorem 3.
Firstly, we prove satisfying the first condition of Theorem 3.
Since , then Using (34), is an independent variable with since is affine with . For convenience, let Then,where , andwhere
Since and only take a finite number of values, then has the upper bound; we assume the upper bound is .
Since , we have Let ; then, has the upper bound .
Sincefor a similar reason, has the upper bound . Using inequalities (35), (37), and (38), we have Let ; then, we have

Secondly, since has the upper bound, then the condition of Theorem 3 is satisfied.

By using the Theorem 3, we see that the CD learning algorithm converges.

The theorem is proved.

We obtain convergence conditions of the CD learning algorithm for RBMs. Next, we study the relationship between the learning rate and convergence conditions. Basing the fact that and , we also assume and ; then, we have the following new convergence conditions derived from Theorem 5.

Corollary 6. The CD learning algorithm will converge providing (1), ,(2), where is a positive constant.
The result of Corollary 6 shows that the convergence of the CD algorithm is related to the errors between the empirical distribution function and the distribution function providing the learning rate is deterministic; the error can be controlled by the number of Gibbs sampling.

4.2. Theoretical Analysis of Convergence Conditions

In Section 4.1, we have given the convergence conditions of the CD algorithm for RBMs; the most important term is the error between the empirical distribution function and the distribution function ; the empirical distribution function is the empirical distribution function on the samples obtained by the data and running the Markov chain forward for steps, the distribution function is the limit distribution of the empirical distribution. Fischer and Igel [24] have given the bound of the bias of and : where is the step number of Gibbs sampling in th update, are the number of visible and hidden variables, andwhereThen, we can draw new convergence conditions of the CD algorithm using the conclusion.

Theorem 7. The CD learning algorithm for RBMs will converge providing(1),(2), where are the number of visible and hidden variables and is defined in equality (43), is the step number of Gibbs sampling in th update (), and is a positive constant.

Proof. In order to prove Theorem 7, we have to prove that we can derive the condition of Theorem 5 from the condition of Theorem 7. Now, we prove it.
Using the inequality (43), we have By the condition of Theorem 7, we have ; using the above the inequality, we haveThe proof of the theorem is completed.

Theorem 7 gives new convergence conditions; the second condition of the theorem is the relationship which the step number of Gibbs sampling and the learning rate must satisfy in every step of parameter updating in order to guarantee the CD algorithm convergence.

5. Conclusions

In this paper, we have studied the convergence of the CD learning algorithm. Firstly, we relate the CD learning algorithm to the gradient method with errors and give convergence conditions which ensure that the CD learning algorithm converges. Convergence conditions mainly include three aspects of contents. Our convergence conditions are different from conditions that were obtained by Yuille [16]; our convergence conditions have more practical value. Moreover, we have studied convergence of the CD algorithm for RBMs; particularly, we give convergence conditions of the CD algorithm for RBMs in which both visible units and hidden units only take a finite number of values. We give the analysis of the consistency of the CD algorithm; meanwhile, we also give two new convergence conditions by specifying the learning rate.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work is supported by the National Science Foundation of China (nos. 61273365, 11407776) and National High Technology Research and Development Program of China (no. 2012AA011103).

References

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
A.-R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
View at: Publisher Site | Google Scholar
A.-R. Mohamed, T. N. Sainath, G. E. Dahl, B. Ramabhadran, G. E. Hinton, and M. A. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Proceedings of the 36th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '11), pp. 5060–5063, May 2011.
View at: Publisher Site | Google Scholar
V. Nair and G. E. Hinton, “3D object recognition with deep belief nets,” in Proceedings of the Neural Information Processing Systems Conference (NIPS '09), pp. 1339–1347, 2009.
View at: Google Scholar
M. A. Salama, A. E. Hassanien, and A. A. Fahmy, “Deep Belief Network for clustering and classification of a continuous data,” in Proceedings of the 10th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT '10), pp. 473–477, December 2010.
View at: Publisher Site | Google Scholar
F. Feng, R. Li, and X. Wang, “Deep correspondence restricted boltzmann machine for cross-modal retrieval,” Neurocomputing, vol. 154, pp. 50–60, 2015.
View at: Publisher Site | Google Scholar
N. Jaitly and G. Hinton, “Learning a better representation of speech soundwaves using restricted boltzmann machines,” in Proceedings of the 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP '11), pp. 5884–5887, May 2011.
View at: Publisher Site | Google Scholar
V. Mnih, H. Larochelle, and G. E. Hinton, “Conditional restricted Boltzmann machines for structured output prediction,” in Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI '11), F. G. Cozman and A. Pfeffer, Eds., p. 514, AUAI Press, 2011.
View at: Google Scholar
A.-R. Mohamed and G. E. Hinton, “Phone recognition using restricted boltzmann machines,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '10), pp. 4354–4357, IEEE, Dallas, Tex, USA, March 2010.
View at: Publisher Site | Google Scholar
R. R. Salakhutdinov and G. E. Hinton, “Replicated soft-max: an undirected topic model,” in Advances in Neural Information Processing Systems (NIPS 2009), 2009.
View at: Google Scholar
R. R. Salakhutdinov, A. Mnih, and G. E. Hinton, “Restricted Boltzmann machines for collaborative filtering,” in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 791–798, ACM, Corvallis, Ore, USA, June 2007.
View at: Publisher Site | Google Scholar
G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
View at: Publisher Site | Google Scholar
Y. Bengio and O. Delalleau, “Justifying and generalizing contrastive divergence,” Neural Computation, vol. 21, no. 6, pp. 1601–1621, 2009.
View at: Publisher Site | Google Scholar | MathSciNet
S. Akoho and K. Takabatake, “Information geometry of contrastive divergence,” in Proceedings of the International Conference on Information Theory and Statistical Learning (ITSL '08), pp. 3–9, Las Vegas, Nev, USA, July 2008.
View at: Google Scholar
I. Sutskever and T. Tieleman, “On the convergence properties of contrastive divergence,” in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS '10), pp. 473–477, 789–795, May 2010.
View at: Google Scholar
A. Yuille, “The convergence of contrastive divergences,” in Proceedings of the 18th Annual Conference on Neural Information Processing Systems (NIPS '04), pp. 1593–1600, December 2004.
View at: Google Scholar
D. P. Bertsekas, Ed., Nonlinear Programming, Athena Scientic, Belmont, Mass, USA, 1995.
D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient methods with errors,” SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000.
View at: Publisher Site | Google Scholar | MathSciNet
V. S. Borkar, “Asynchronous stochastic approximations,” SIAM Journal on Control and Optimization, vol. 36, no. 3, pp. 840–851, 1998.
View at: Publisher Site | Google Scholar | MathSciNet
G. Pflug, “Optimization of stochastic models,” in The Interface between Simulation and Optimization, Kluwer, Boston, Mass, USA, 1996.
View at: Google Scholar
H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, pp. 400–407, 1951.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
L. Younes, “On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates,” Stochastics and Stochastics Reports, vol. 65, no. 3-4, pp. 177–228, 1999.
View at: Publisher Site | Google Scholar | MathSciNet
G. R. Grimmett and D. R. Stirzaker, Probability and Random Process, Oxford University Press, 2001.
View at: MathSciNet
A. Fischer and C. Igel, “Bounding the bias of contrastive divergence learning,” Neural Computation, vol. 23, no. 3, pp. 664–673, 2011.
View at: Publisher Site | Google Scholar | MathSciNet

Copyright

Copyright © 2015 Xuesi Ma and Xiaojie Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

3321

Downloads

899

Citations