International Scholarly Research Notices

International Scholarly Research Notices / 2012 / Article

Research Article | Open Access

Volume 2012 |Article ID 486361 | 9 pages | https://doi.org/10.5402/2012/486361

An Advanced Conjugate Gradient Training Algorithm Based on a Modified Secant Equation

Academic Editor: T. Kurita
Received05 Aug 2011
Accepted04 Sep 2011
Published08 Dec 2011

Abstract

Conjugate gradient methods constitute excellent neural network training methods characterized by their simplicity, numerical efficiency, and their very low memory requirements. In this paper, we propose a conjugate gradient neural network training algorithm which guarantees sufficient descent using any line search, avoiding thereby the usually inefficient restarts. Moreover, it achieves a high-order accuracy in approximating the second-order curvature information of the error surface by utilizing the modified secant condition proposed by Li et al. (2007). Under mild conditions, we establish that the proposed method is globally convergent for general functions under the strong Wolfe conditions. Experimental results provide evidence that our proposed method is preferable and in general superior to the classical conjugate gradient methods and has a potential to significantly enhance the computational efficiency and robustness of the training process.

1. Introduction

Learning systems, such as multilayer feedforward neural networks (FNN), are parallel computational models comprised of densely interconnected, adaptive processing units, characterized by an inherent propensity for learning from experience and also discovering new knowledge. Due to their excellent capability of self-learning and self-adapting, they have been successfully applied in many areas of artificial intelligence [1ā€“5] and are often found to be more efficient and accurate than other classification techniques [6]. The operation of a FNN is usually based on the following equations: š‘›š‘’š‘”š‘™š‘—=š‘š‘™āˆ’1ī“š‘–=1š‘¤š‘™āˆ’1,š‘™š‘–š‘—š‘¦š‘–š‘™āˆ’1+š‘š‘™š‘—,š‘¦š‘™š‘—ī€·=š‘“š‘›š‘’š‘”š‘™š‘—ī€ø,(1) where š‘›š‘’š‘”š‘™š‘— is the sum of its weighted inputs for the š‘—th node in the š‘™th layer (š‘—=1,ā€¦,š‘š‘™), š‘¤š‘™āˆ’1,š‘™š‘–š‘— are the weights from the š‘–th neuron at the (š‘™āˆ’1) layer to the š‘—th neuron at the š‘™th layer, š‘š‘™š‘— is the bias of the š‘—th neuron at the š‘™th layer, š‘¦š‘™š‘– is the output of the š‘—th neuron that belongs to the š‘™th layer, and š‘“(š‘›š‘’š‘”š‘™š‘—) is the š‘—th neuron activation function.

The problem of training a neural network is to iteratively adjust its weights, in order to globally minimize a measure of difference between the actual output of the network and the desired output for all examples of the training set [7]. More mathematically, the training process can be formulated as the minimization of the error function šø(š‘¤), defined by the sum of square differences between the actual output of the FNN, denoted by š‘¦šæš‘—,š‘ and the desired output, denoted by š‘”š‘—,š‘, relative to the appeared output, namely,šø(š‘¤)=š‘ƒī“š‘š‘=1šæī“š‘—=1ī€·š‘¦šæš‘—,š‘āˆ’š‘”š‘—,š‘ī€ø2,(2) where š‘¤āˆˆā„š‘› is the vector network weights and š‘ƒ represents the number of patterns used in the training set.

Conjugate gradient methods are probably the most famous iterative methods for efficiently training neural networks due to their simplicity, numerical efficiency, and their very low memory requirements. These methods generate a sequence of weights {š‘¤š‘˜} using the iterative formulaš‘¤š‘˜+1=š‘¤š‘˜+šœ‚š‘˜š‘‘š‘˜,š‘˜=0,1,ā€¦,(3) where š‘˜ is the current iteration usually called epoch, š‘¤0āˆˆā„š‘› is a given initial point, šœ‚š‘˜>0 is the learning rate, and š‘‘š‘˜ is a descent search direction defined byš‘‘š‘˜=ī‚»āˆ’š‘”0,ifš‘˜=0,āˆ’š‘”š‘˜+š›½š‘˜š‘‘š‘˜āˆ’1,otherwise,(4) where š‘”š‘˜ is the gradient of šø at š‘¤š‘˜ and š›½š‘˜ is a scalar. In the literature, there have been proposed several choices for š›½š‘˜ which give rise to distinct conjugate gradient methods. The most well-known conjugate gradient methods include the Fletcher-Reeves (FR) method [8], the Hestenes-Stiefel (HS) method [9], and the Polak-RibiĆØre (PR) method [10]. The update parameters of these methods are, respectively, specified as follows:š›½HSš‘˜=š‘”š‘‡š‘˜š‘¦š‘˜āˆ’1š‘¦š‘‡š‘˜āˆ’1š‘‘š‘˜āˆ’1,š›½FRš‘˜=ā€–ā€–š‘”š‘˜ā€–ā€–2ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2,š›½PRš‘˜=š‘”š‘‡š‘˜š‘¦š‘˜āˆ’1ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2,(5) where š‘ š‘˜āˆ’1=š‘„š‘˜āˆ’š‘„š‘˜āˆ’1, š‘¦š‘˜āˆ’1=š‘”š‘˜āˆ’š‘”š‘˜āˆ’1 and ā€–ā‹…ā€– denotes the Euclidean norm.

The PR method behaves like the HS method in practical computation and it is generally believed to be one of the most efficient conjugate gradient methods. However, despite the practical advantages of this method, it has the major drawback of not being globally convergent for general functions and as a result it may be trapped and cycle infinitely without presenting any substantial progress [11]. For rectifying the convergence failure of the PR method, Gilbert and Nocedal [12], motivated by Powellā€™s work [13], proposed to restrict the update parameter š›½š‘˜ of being nonnegative, namely, š›½PR+š‘˜=max{š›½PRš‘˜,0}. The authors conducted an elegant analysis of this conjugate gradient method (PR+) and established that it is globally convergent under strong assumptions. Moreover, although that the PR method and the PR+ method usually perform better than the other conjugate gradient methods, they cannot guarantee to generate descent directions, hence restarts are employed in order to guarantee convergence. Nevertheless, there is also a worry with restart algorithms that their restarts may be triggered too often; thus degrading the overall efficiency and robustness of the minimization process [14].

During the last decade, much effort has been devoted to develop new conjugate gradient methods which are not only globally convergent for general functions but also computationally superior to classical methods and are classified in two classes. The first class utilizes second-order information to accelerate conjugate gradient methods by utilizing new secant equations (see [15ā€“18]). Sample works include the nonlinear conjugate gradient methods proposed by Zhang et al. [19ā€“21] which are based on MBFGS secant equation [15]. Ford et al. [22] proposed a multistep conjugate gradient method that is based on the multistep quasi-Newton methods proposed in [16, 17]. Recently, Yabe and Takano [23] and Li et al. [18] proposed conjugate gradient methods which are based on modified secant equation using both the gradient and function values with higher orders of accuracy in the approximation of the curvature. Under proper conditions, these methods are globally convergent and sometimes their numerical performance is superior to classical conjugate gradient methods. However, these methods do not ensure to generate descent directions; therefore the descent condition is usually assumed in their analysis and implementations.

The second class aims at developing conjugate gradient methods which generate descent directions, in order to avoid the usually inefficient restarts. On the basis of this idea, Zhang et al. [20, 24ā€“26] modified the search direction in order to ensure sufficient descent, that is, š‘‘š‘‡š‘˜š‘”š‘˜=āˆ’ā€–š‘”š‘˜ā€–2, independent of the performed line search. Independently, Hager and Zhang [27] modified the parameter š›½š‘˜ and proposed a new descent conjugate gradient method, called the CG-DESCENT method. More analytically, they proposed a modification of the Hestenes-Stiefel formula š›½HSš‘˜ in the following way:š›½HZš‘˜=š›½HSš‘˜ā€–ā€–š‘¦āˆ’2š‘˜āˆ’1ā€–ā€–2ī€·š‘‘š‘‡š‘˜āˆ’1š‘¦š‘˜āˆ’1ī€ø2š‘”š‘‡š‘˜š‘‘š‘˜āˆ’1.(6) Along this line, Yuan [28] based on [12, 27, 29], proposed a modified PR method, that is,š›½DPR+š‘˜=š›½PRš‘˜īƒÆš›½āˆ’minPRš‘˜ā€–ā€–š‘¦,š¶š‘˜āˆ’1ā€–ā€–2ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–4š‘”š‘‡š‘˜š‘‘š‘˜āˆ’1īƒ°,(7) where š¶ is a parameter which essentially controls the relative weight between conjugacy and descent and in case š¶>1/4 then the above formula satisfies š‘”š‘‡š‘˜š‘‘š‘˜ā‰¤āˆ’(1āˆ’1/4š¶)ā€–š‘”š‘˜ā€–2. An important feature of this method is that it is globally convergent for general functions. Recently, Livieris et al. [30ā€“32] motivated by the previous works presented some descent conjugate gradient training algorithms providing some promising results. Based on their numerical experiments, the authors concluded that the sufficient descent property led to a significant improvement of the training process.

In this paper, we proposed a new conjugate gradient training algorithm which has both characteristics of the previous presented classes. Our method ensures sufficient descent independent of the accuracy of the line search, avoiding thereby the usually inefficient restarts. Moreover, it achieves a high-order accuracy in approximating the second-order curvature information of the error surface by utilizing the modified secant condition proposed in [18]. Under mild conditions, we establish the global convergence of our proposed method.

The remainder of this paper is organized as follows. In Section 2, we present our proposed conjugate gradient training algorithm and in Section 3, we present its global convergence analysis. The experimental results are reported in Section 4 using the performance profiles of Dolan and MorĆØ [33]. Finally, Section 5 presents our concluding remarks.

2. Modified Polak-RibiĆØre+ Conjugate Gradient Algorithm

Firstly, we recall that for quasi-Newton methods, an approximation matrix šµš‘˜āˆ’1 to the Hessian āˆ‡2šø(š‘¤š‘˜āˆ’1) of a nonlinear function šø is updated so that a new matrix šµš‘˜ satisfies the following secant condition:šµš‘˜š‘ š‘˜āˆ’1=š‘¦š‘˜āˆ’1.(8) Obviously, only two gradients are exploited in the secant equation (8), while the function values available are neglected. Recently, Li et al. [18] proposed a conjugate gradient method based on the modified secant conditionšµš‘˜š‘ š‘˜āˆ’1=Ģƒš‘¦š‘˜āˆ’1,Ģƒš‘¦š‘˜āˆ’1=š‘¦š‘˜āˆ’1+ī€½šœƒmaxš‘˜āˆ’1ī€¾,0ā€–ā€–š‘ š‘˜āˆ’1ā€–ā€–2š‘ š‘˜āˆ’1,(9) where šœƒš‘˜āˆ’1 is defined byšœƒš‘˜āˆ’1ī€·šø=2š‘˜āˆ’1āˆ’šøš‘˜ī€ø+ī€·š‘”š‘˜+š‘”š‘˜āˆ’1ī€øš‘‡š‘ š‘˜āˆ’1,(10) and šøš‘˜ denotes šø(š‘¤š‘˜). The authors proved that this new secant equation (9) is superior to the classical one (8) in the sense that Ģƒš‘¦š‘˜āˆ’1 better approximates āˆ‡2šø(š‘¤š‘˜)š‘ š‘˜āˆ’1 than š‘¦š‘˜āˆ’1 (see [18]).

Motivated by the theoretical advantages of this modified secant condition (9), we propose a modification of formula (7), in the following way:š›½MPR+š‘˜=š‘”š‘‡š‘˜Ģƒš‘¦š‘˜āˆ’1ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2īƒÆš‘”āˆ’minš‘‡š‘˜Ģƒš‘¦š‘˜āˆ’1ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2ā€–ā€–,š¶Ģƒš‘¦š‘˜āˆ’1ā€–ā€–2ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–4š‘”š‘‡š‘˜š‘‘š‘˜āˆ’1īƒ°,(11) with š¶>1/4. It is easy to see from (4) and (11) that our proposed formula š›½MPR+š‘˜ satisfies the sufficient descent conditionš‘”š‘‡š‘˜š‘‘š‘˜ī‚€1ā‰¤āˆ’1āˆ’ī‚ā€–ā€–š‘”4š¶š‘˜ā€–ā€–2,(12) independent of the line search used.

At this point, we present a high level description of our proposed algorithm, called modified Polak-RibiĆØre+ conjugate gradient algorithm (MPR+-CG).

Algorithm 1 (modified Polak-RibiĆØre+ conjugate gradient algorithm). Step 1. Initiate š‘¤0, 0<šœŽ1<šœŽ2<1, šøšŗ and š‘˜MAX; set š‘˜=0.Step 2. Calculate the error function value šøš‘˜ and its gradient š‘”š‘˜.Step 3. If (šøš‘˜<šøšŗ), return š‘¤āˆ—=š‘¤š‘˜ and šøāˆ—=šøš‘˜.Step 4. If (š‘”š‘˜=0), return ā€œError goal not metā€.Step 5. Compute the descent direction š‘‘š‘˜ using (4) and (11).Step 6. Compute the learning rate šœ‚š‘˜ using the strong Wolfe line search conditions šøī€·š‘¤š‘˜+š›¼š‘˜š‘‘š‘˜ī€øī€·š‘¤āˆ’šøš‘˜ī€øā‰¤šœŽ1š›¼š‘˜š‘”š‘‡š‘˜š‘‘š‘˜,(13)|||š‘”ī€·š‘¤š‘˜+š›¼š‘˜š‘‘š‘˜ī€øš‘‡š‘‘š‘˜|||ā‰¤šœŽ2||š‘”š‘‡š‘˜š‘‘š‘˜||.(14)Step 7. Update the weights š‘¤š‘˜+1=š‘¤š‘˜+šœ‚š‘˜š‘‘š‘˜(15) and set š‘˜=š‘˜+1.Step 8. If (š‘˜>š‘˜MAX) return ā€œerror goal not metā€, else go to Step 2.

3. Global Convergence Analysis

In order to establish the global convergence result for our proposed method, we will impose the following assumptions on the error function šø.

Assumption 1. The level set ā„’={š‘¤āˆˆā„š‘›āˆ£šø(š‘¤)ā‰¤šø(š‘¤0)} is bounded.

Assumption 2. In some neighborhood š’©āˆˆā„’, šø is differentiable and its gradient š‘” is Lipschitz continuous, namely, there exists a positive constant šæ>0 such that ā€–ā€–š‘”ī€·ī‚š‘¤ī€øā€–ā€–ā€–ā€–ī‚š‘¤ā€–ā€–ī‚(š‘¤)āˆ’š‘”ā‰¤šæš‘¤āˆ’,andāˆ€š‘¤,š‘¤āˆˆš’©.(16)

Since {šø(š‘¤š‘˜)} is a decreasing sequence, it is clear that the sequence {š‘¤š‘˜} is contained in ā„’. In addition, it follows directly from Assumptions 1 and 2 that there exist positive constraints šµ and š‘€, such that ā€–ā€–ī‚š‘¤ā€–ā€–ī‚š‘¤āˆ’ā‰¤šµ,āˆ€š‘¤,š‘¤āˆˆā„’,(17)ā€–š‘”(š‘¤)ā€–ā‰¤š‘€,āˆ€š‘¤āˆˆā„’.(18) Furthermore, notice that since the error function šø is bounded below in ā„š‘› by zero, it is differentiable and its gradient is Lipschitz continuous [34]. Assumptions 1 and 2 always hold.

The following lemma is very useful for the global convergence analysis.

Lemma 2 (see [18]). Suppose that Assumptions 1 and 2 hold and the line search satisfies the strong Wolfe line search conditions (13) and (14). For šœƒš‘˜ and Ģƒš‘¦š‘˜ defined in (10) and (9), respectively, one has ||šœƒš‘˜||ā€–ā€–š‘ ā‰¤šæš‘˜ā€–ā€–2,ā€–ā€–Ģƒš‘¦š‘˜ā€–ā€–ā€–ā€–š‘ ā‰¤2šæš‘˜ā€–ā€–.(19)
Subsequently, we will establish the global convergence of Algorithm MPR+-CG for general functions. Firstly, we present a lemma that Algorithm MPR+-CG prevents the inefficient behavior of the jamming phenomenon [35] from occurring. This property is similar to but slightly different from Property(āˆ—), which was derived by Gilbert and Nocedal [12].

Lemma 3. Suppose that Assumptions 1 and 2 hold. Let {š‘¤š‘˜} and {š‘‘š‘˜} be generated by Algorithm MPR+-CG, if there exists a positive constant šœ‡>0 such that ā€–ā€–š‘”š‘˜ā€–ā€–ā‰„šœ‡,āˆ€š‘˜ā‰„0,(20) then there exist constants š‘>1 and šœ†>0 such that ||š›½MPR+š‘˜||ā€–ā€–š‘ ā‰¤š‘,(21)š‘˜āˆ’1ā€–ā€–||š›½ā‰¤šœ†āŸ¹MPR+š‘˜||ā‰¤1š‘.(22)

Proof. Utilizing Lemma 2 together with Assumption 2 and relations (12), (14), (17), (18), (20) we have ||š›½MPR+š‘˜||ā‰¤||š‘”š‘‡š‘˜Ģƒš‘¦š‘˜āˆ’1||ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2ā€–ā€–+š¶Ģƒš‘¦š‘˜āˆ’1ā€–ā€–2ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–4||š‘”š‘‡š‘˜š‘‘š‘˜āˆ’1||ā‰¤ā€–ā€–š‘”š‘˜ā€–ā€–ā€–ā€–Ģƒš‘¦š‘˜āˆ’1ā€–ā€–ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2ā€–ā€–+š¶Ģƒš‘¦š‘˜āˆ’1ā€–ā€–2šœŽ2||š‘”š‘‡š‘˜āˆ’1š‘‘š‘˜āˆ’1||ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–4ā‰¤ā€–ā€–š‘ 2š‘€šæš‘˜āˆ’1ā€–ā€–ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–2+š¶4šæ2ā€–ā€–š‘ š‘˜āˆ’1ā€–ā€–2ā€–ā€–š‘”š‘˜āˆ’1ā€–ā€–4šœŽ2ī‚€11āˆ’ī‚ā€–ā€–š‘”4š¶š‘˜āˆ’1ā€–ā€–2ā‰¤ī‚µ2š‘€šæ+4šæ2šµš¶šœŽ2(1āˆ’1/4š¶)šœ‡2ī‚¶ā€–ā€–š‘ š‘˜āˆ’1ā€–ā€–ā€–ā€–š‘ ā‰œš·š‘˜āˆ’1ā€–ā€–.(23) Therefore, by setting b:= max{2,2š·šµ} and Ī»:= 1/š·š‘, we have relations (21) and (22) hold. The proof is completed.

Subsequently, we present a lemma which shows that, asymptotically, the search directions š‘¤š‘˜ change slowly. This lemma corresponds to Lemma 4.1 of Gilbert and Nocedal [12] and the proof is exactly the same as that of Lemma 4.1 in [12], thus we omit it.

Lemma 4. Suppose that Assumptions 1 and 2 hold. Let {š‘¤š‘˜} and {š‘‘š‘˜} be generated by Algorithm MPR+-CG, if there exists a positive constant šœ‡>0 such that (21) holds; then š‘‘š‘˜ā‰ 0 and ī“š‘˜ā‰„1ā€–ā€–š‘¢š‘˜āˆ’š‘¢š‘˜āˆ’1ā€–ā€–2<āˆž,(24) where š‘¢š‘˜=š‘‘š‘˜/ā€–š‘‘š‘˜ā€–.

Next, by making use of Lemmas 3 and 4, we establish the global convergence theorem for Algorithm MPR+-CG under the strong Wolfe line search.

Theorem 5. Suppose that Assumptions 1 and 2 hold. If {š‘¤š‘˜} is obtained by Algorithm MPR+-CG where the line search satisfies the strong Wolfe line search conditions (13) and (14), then one has limš‘˜ā†’āˆžā€–ā€–š‘”infš‘˜ā€–ā€–=0.(25)

Proof. We proceed by contraction. Suppose that there exists a positive constant šœ‡>0 such that for all š‘˜ā‰„0ā€–ā€–š‘”š‘˜ā€–ā€–ā‰„šœ‡.(26) The proof is divided in the following two steps.Step I
A bound on the step š‘ š‘˜. Let Ī” be a positive integer, chosen large enough that Ī”ā‰„4šµš·,(27) where šµ and š· are defined in (17) and (23), respectively. For any š‘™>š‘˜ā‰„š‘˜0 with š‘™āˆ’š‘˜ā‰¤Ī”, following the same proof as the case II of Theorem 3.2 in [27], we get š‘™āˆ’1ī“š‘—=š‘˜ā€–ā€–š‘ š‘—ā€–ā€–<2šµ.(28)
Step II
A bound on the search directions of š‘‘š‘™. It follows from the definition of š‘‘š‘˜ in (4) together with (18) and (23), we obtain ā€–ā€–š‘‘š‘™ā€–ā€–2ā‰¤ā€–ā€–š‘”š‘™ā€–ā€–2+||š›½MPR+š‘˜||2ā€–ā€–š‘‘š‘™āˆ’1ā€–ā€–2ā‰¤š‘€2+š·2ā€–ā€–š‘ š‘™āˆ’1ā€–ā€–2ā€–ā€–š‘‘š‘™āˆ’1ā€–ā€–2.(29)
Now, the remaining argument is standard in the same way as case III in Theorem 3.2 in [27], thus we omit it. This completes the proof.

4. Experimental Results

In this section, we will present experimental results in order to evaluate the performance of our proposed conjugate gradient algorithm MPR+-CG in five famous classification problems acquired by the UCI Repository of Machine Learning Databases [36]: the iris problem, the diabetes problem, the sonar problem, the yeast problem, and the Escherichia coli problem.

The implementation code was written in Matlab 6.5 on a Pentium IV computer (2.4ā€‰MHz, 512ā€‰Mbyte RAM) running Windows XP operating system based on the SCG code of Birgin and MartĆ­nez [37]. All methods are implemented with the line search proposed in CONMIN [38] which employs various polynomial interpolation schemes and safeguards in satisfying the strong Wolfe line search conditions. The heuristic parameters were set as šœŽ1=10āˆ’4 and šœŽ2=0.5 as in [30, 39]. All networks have received the same sequence of input patterns and the initial weights were generated using the Nguyen-Widrow method [40]. For evaluating classification accuracy we, have used the standard procedure called š‘˜-fold cross-validation [41]. The results have been averaged over 500 simulations.

4.1. Training Performance

The cumulative total for a performance metric over all simulations does not seem to be too informative, since a small number of simulations can tend to dominate these results. For this reason, we use the performance profiles proposed by Dolan and MorĆØ [33] to present perhaps the most complete information in terms of robustness, efficiency, and solution quality. The performance profile plots the fraction š‘ƒ of simulations for which any given method is within a factor šœ of the best training method. The horizontal axis of each plot shows the percentage of the simulations for which a method is the fastest (efficiency), while the vertical axis gives the percentage of the simulations that the neural networks were successfully trained by each method (robustness). The reported performance profiles have been created using the Libopt environment [42] for measuring the efficiency and the robustness of our method in terms of computational time (CPU time) and function/gradient evaluations (FE/GE). The curves in the following figures have the following meaning.(i)ā€œPRā€™ā€™ stands for the Polak-RibiĆØre conjugate gradient method.(ii)ā€œPR+ā€™ā€™ stands for the Polak-RibiĆØre+ conjugate gradient method.(iii)ā€œMPR+ā€™ā€™ stands for Algorithm MPR+-CG.

4.1.1. Iris Classification Problem

This benchmark is perhaps the most best known to be found in the pattern-recognition literature [36]. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The network architectures constitute of 1 hidden layer with 7 neurons and an output layer of 3 neurons. The training goal was set to šøšŗā‰¤0.01 within the limit of 1000 epochs and all networks were tested using 10-fold cross-validation [30].

Figure 1 presents the performance profiles for the iris classification problem, regarding both performance metrics. MPR+ illustrates the best performance in terms of efficiency and robustness, significantly outperforming the classical training methods PR and PR+. Furthermore, the performance profiles show that MPR+ is the only method reporting an excellent (100%) probability of being the optimal training method.

4.1.2. Diabetes Classification Problem

The aim of this real-world classification task is to decide whether a Pima Indian female is diabetes positive or not. The data of this benchmark consists of 768 different patterns, each of them having 8 features of real continuous values and a class label (diabetes positive or not). We have used neural networks with 2 hidden layers of 4 neurons each and an output layer of 2 neurons [43]. The training goal was set to šøšŗ<0.14 within the limit of 2000 epochs and all networks were tested using 10-fold cross-validation [44].

Figure 2 illustrates the performance profiles for the diabetes classification problem, investigating the efficiency and robustness of each training method. Clearly, our proposed method MPR+ significantly outperforms the conjugate gradient methods PR and PR+ since the curves of the former lie above the curves of the latter, regarding both performance metrics. More analytically, the performance profiles show that the probability of MPR+ to successfully train a neural network within a factor 3.41 of the best solver is 100%, in contrast with PR and PR+ which have probability 84.3% and 85%, respectively.

4.1.3. Sonar Classification Problem

This is the dataset used by Gorman and Sejnowski [45] in their study of the classification of sonar signals using a neural network. The dataset contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. The network architecture for this problem constitutes of 1 hidden layer of 24 neurons and an output layer of 2 neurons [45]. The training goal was set to šøšŗ=0.1 within the limit of 1000 epochs and all networks were tested using 3-fold cross-validation [45].

In Figure 3 are presented the performance profiles for the sonar classification problem, relative to both performance metrics. Our proposed conjugate gradient method MPR+ presents the highest probability of being the optimal training method. Furthermore, MPR+ significantly outperforms PR and is slightly more robust than PR+, regarding both performance metrics.

4.1.4. Yeast Classification Problem

This problem is based on a drastically imbalanced dataset and concerns the determination of the cellular localization of the yeast proteins into ten localization sites. Saccharomyces cerevisiae (yeast) is the simplest Eukaryotic organism. The network architecture for this classification problem consists of 1 hidden layer of 16 neurons and an output layer of 10 neurons [46]. The training goal was set to šøšŗ<0.05 within the limit of 2000 epochs and all networks were tested using 10-fold cross validation [47].

Figure 4 presents the performance profiles for the yeast classification problem, regarding both performance metrics. The interpretation in Figure 4 highlights that our proposed conjugate gradient method MPR+ is the only method exhibiting an excellent (100%) probability of successful training. Moreover, it is worth noticing that PR and PR+ report very poor performance exhibiting 0% and 5% probability of successfully training, respectively, in contrast with our proposed method MPR+ which has successfully trained all neural networks.

4.1.5. Escherichia coli Classification Problem

This problem is based on a drastically imbalanced data set of 336 patterns and concerns the classification of the E. coli protein localization patterns into eight localization sites. E. coli, being a prokaryotic gram-negative bacterium, is an important component of the biosphere. Three major and distinctive types of proteins are characterized in E. coli: enzymes, transporters, and regulators. The largest number of genes encoding enzymes (34%) (this should include all the cytoplasm proteins) is followed by the genes for transport functions and the genes for regulatory process (11.5%) [48]. The network architectures constitute of 1 hidden layer with 16 neurons and an output layer of 8 neurons [46]. The training goal was set to šøšŗā‰¤0.02 within the limit of 2000 epochs and all neural networks were tested using 4-fold cross-validation [47].

In Figure 5 are presented the performance profiles for the Escherichia coli classification problem. Similar observations can be made with the previous benchmarks. More specifically, MPR+ significantly outperforms the classical training methods PR and PR+, since the curves of the former lie above the curves of the latter, regarding both performance metrics. Moreover the performance profiles show that the probability of MPR+ is the only method reporting excellent (100%) probability of being the optimal training method.

4.2. Generalization Performance

In Table 1 are summarized the generalization results of PR, PR+, and MPR+ conjugate gradient methods, measured by the percentage of testing patterns that were classified correctly in the presented classification problems. Each row reports the average performance in percentage for each problem and the best conjugate gradient method for a problem is illustrated in boldface. Moreover, ā€œāˆ’ā€ means that the method reported 0% training success.


Classification problems
Method Iris Diabetes Sonar YeastE.coli

PR 95.0% 76.1%75.9%ā€”95.8%
PR+ 95.3% 76.1% 75.6% 92.0%96.0%
MPR+98.1%76.4%75.9%92.5%96.0%

The interpretation on Table 1 illustrates that MPR+ is an excellent generalizer since it manages to have the highest generalization performance, outperforming the classical training methods PR and PR+ in all classification problems.

5. Conclusions

In this paper, we proposed a conjugate gradient method for efficiently training neural networks. An attractive property of our proposed method is that it ensures sufficient descent, avoiding thereby the usually inefficient restarts. Furthermore, it achieves a high-order accuracy in approximating the second-order curvature information of the error surface by utilizing the modified secant equation proposed in [18]. Under mild conditions, we established that our proposed method is globally convergent. Based on our numerical experiments, we concluded that our proposed method outperforms classical conjugate gradient training methods and has a potential to significantly enhance the computational efficiency and robustness of the training process.

References

  1. C. M. Bishop, Neural Networks for Pattern Recognition, Oxford, UK, 1995.
  2. S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College, New York, NY, USA, 1994.
  3. A. Hmich, A. Badri, and A. Sahel, ā€œAutomatic speaker identification by using the neural network,ā€ in Proceedings of the IEEE International Conference on Multimedia Computing and Systems (ICMCS '11), pp. 1ā€“5, April 2011. View at: Publisher Site | Google Scholar
  4. H. Takeuchi, Y. Terabayashi, K. Yamauchi, and N. Ishii, ā€œImprovement of robot sensor by integrating information using neural network,ā€ International Journal on Artificial Intelligence Tools, vol. 12, no. 2, pp. 139ā€“152, 2003. View at: Google Scholar
  5. C. H. Wu, H. L. Chen, and S. C. Chen, ā€œGene classification artificial neural system,ā€ International Journal on Artificial Intelligence Tools, vol. 4, no. 4, pp. 501ā€“510, 1995. View at: Google Scholar
  6. B. Lerner, H. Guterman, M. Aladjem, and I. Dinstein, ā€œA comparative study of neural network based feature extraction paradigms,ā€ Pattern Recognition Letters, vol. 20, no. 1, pp. 7ā€“14, 1999. View at: Publisher Site | Google Scholar
  7. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ā€œLearning internal representations by error propagation,ā€ in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart and J. McClelland, Eds., pp. 318ā€“362, Cambridge, Mass, USA, 1986. View at: Google Scholar
  8. R. Fletcher and C. M. Reeves, ā€œFunction minimization by conjugate gradients,ā€ Computer Journal, vol. 7, pp. 149ā€“154, 1964. View at: Google Scholar
  9. M. R. Hestenes and E. Stiefel, ā€œMethods for conjugate gradients for solving linear systems,ā€ Journal of Research of the National Bureau of Standards, vol. 49, no. 6, pp. 409ā€“436, 1952. View at: Google Scholar
  10. E. Polak and G. Ribière, ā€œNote sur la convergence de methods de directions conjuguees,ā€ Revue Francais d'Informatique et de Recherche Operationnelle, vol. 16, pp. 35ā€“43, 1969. View at: Google Scholar
  11. M. J. D. Powell, ā€œNonconvexminimization calculations and the conjugate gradientmethod,ā€ in Numerical Analysis, vol. 1066 of Lecture notes in mathematics, pp. 122ā€“141, Springer, Berlin, Germany, 1984. View at: Google Scholar
  12. J. C. Gilbert and J. Nocedal, ā€œGlobal convergence properties of conjugate gradient methods for optimization,ā€ SIAM Journal of Optimization, vol. 2, no. 1, pp. 21ā€“42, 1992. View at: Google Scholar
  13. M. J. D. Powell, ā€œConvergence properties of algorithms for nonlinear optimization,ā€ Siam Review, vol. 28, no. 4, pp. 487ā€“500, 1986. View at: Google Scholar
  14. J. Nocedal, ā€œTheory of algorithms for unconstrained optimization,ā€ Acta Numerica, vol. 1, pp. 199ā€“242, 1992. View at: Google Scholar
  15. D. H. Li and M. Fukushima, ā€œA modified BFGS method and its global convergence in nonconvex minimization,ā€ Journal of Computational and Applied Mathematics, vol. 129, no. 1-2, pp. 15ā€“35, 2001. View at: Publisher Site | Google Scholar | MathSciNet
  16. J. A. Ford and I. A. Moghrabi, ā€œMulti-step quasi-Newton methods for optimization,ā€ Journal of Computational and Applied Mathematics, vol. 50, no. 1-3, pp. 305ā€“323, 1994. View at: Google Scholar
  17. J. A. Ford and I. A. Moghrabi, ā€œUsing function-values in multi-step quasi-Newton methods,ā€ Journal of Computational and Applied Mathematics, vol. 66, no. 1-2, pp. 201ā€“211, 1996. View at: Google Scholar
  18. G. Li, C. Tang, and Z. Wei, ā€œNew conjugacy condition and related new conjugate gradient methods for unconstrained optimization,ā€ Journal of Computational and Applied Mathematics, vol. 202, no. 2, pp. 523ā€“539, 2007. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  19. L. Zhang, ā€œTwo modified Dai-Yuan nonlinear conjugate gradient methods,ā€ Numerical Algorithms, vol. 50, no. 1, pp. 1ā€“16, 2009. View at: Publisher Site | Google Scholar
  20. L. Zhang, W. Zhou, and D. Li, ā€œSome descent three-term conjugate gradient methods and their global convergence,ā€ Optimization Methods and Software, vol. 22, no. 4, pp. 697ā€“711, 2007. View at: Publisher Site | Google Scholar | MathSciNet
  21. W. Zhou and L. Zhang, ā€œA nonlinear conjugate gradient method based on the MBFGS secant condition,ā€ Optimization Methods and Software, vol. 21, no. 5, pp. 707ā€“714, 2006. View at: Google Scholar
  22. J. A. Ford, Y. Narushima, and H. Yabe, ā€œMulti-step nonlinear conjugate gradient methods for unconstrained minimization,ā€ Computational Optimization and Applications, vol. 40, no. 2, pp. 191ā€“216, 2008. View at: Publisher Site | Google Scholar
  23. H. Yabe and M. Takano, ā€œGlobal convergence properties of nonlinear conjugate gradient methods with modified secant condition,ā€ Computational Optimization and Applications, vol. 28, no. 2, pp. 203ā€“225, 2004. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  24. L. Zhang, W. Zhou, and D. Li, ā€œGlobal convergence of a modified Fletcher-Reeves conjugate gradient method with Armijo-type line search,ā€ Numerische Mathematik, vol. 104, no. 4, pp. 561ā€“572, 2006. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  25. L. Zhang, W. Zhou, and D. Li, ā€œA descent modified Polak-Ribière-Polyak conjugate gradient method and its global convergence,ā€ IMA Journal of Numerical Analysis, vol. 26, no. 4, pp. 629ā€“640, 2006. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  26. L. Zhang and W. Zhou, ā€œTwo descent hybrid conjugate gradient methods for optimization,ā€ Journal of Computational and Applied Mathematics, vol. 216, no. 1, pp. 251ā€“264, 2008. View at: Publisher Site | Google Scholar
  27. W. W. Hager and H. Zhang, ā€œA new conjugate gradient method with guaranteed descent and an efficient line search,ā€ SIAM Journal on Optimization, vol. 16, no. 1, pp. 170ā€“192, 2005. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  28. G. Yuan, ā€œModified nonlinear conjugate gradient methods with sufficient descent property for large-scale optimization problems,ā€ Optimization Letters, vol. 3, no. 1, pp. 11ā€“21, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  29. G. H. Yu., Nonlinear self-scaling conjugate gradient methods for large-scale optimization problems, Ph.D. thesis, Sun Yat-Sen University, 2007.
  30. I. E. Livieris and P. Pintelas, ā€œPerformance evaluation of descent CG methods for neural network training,ā€ in Proceedings of the 9th Hellenic European Research on Computer Mathematics & its Applications Conference (HERCMA '09), E. A. Lipitakis, Ed., pp. 40ā€“46, 2009. View at: Google Scholar
  31. I. E. Livieris and P. Pintelas, ā€œAn improved spectral conjugate gradient neural network training algorithm,ā€ International Journal on Artificial Intelligence Tools. In press. View at: Google Scholar
  32. I. E. Livieris, D. G. Sotiropoulos, and P. Pintelas, ā€œOn descent spectral CG algorithms for training recurrent neural networks,ā€ in Proceedings of the 13th Panellenic Conference of Informatics, pp. 65ā€“69, 2009. View at: Google Scholar
  33. E. Dolan and J. J. Moré, ā€œBenchmarking optimization software with performance profiles,ā€ Mathematical Programming, vol. 91, no. 2, pp. 201ā€“213, 2002. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  34. J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley, Reading, Mass, USA, 1991.
  35. M. J. D. Powell, ā€œRestart procedures for the conjugate gradient method,ā€ Mathematical Programming, vol. 12, no. 1, pp. 241ā€“254, 1977. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  36. P. M. Murphy and D. W. Aha, UCI Repository of Machine Learning Databases, University of California, Department of Information and Computer Science, Irvine, Calif, USA, 1994.
  37. E. G. Birgin and J. M. Martínez, ā€œA spectral conjugate gradient method for unconstrained optimization,ā€ Applied Mathematics and Optimization, vol. 43, no. 2, pp. 117ā€“128, 2001. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  38. D. F. Shanno and K. H. Phua, ā€œMinimization of unconstrained multivariate functions,ā€ ACM Transactions on Mathematical Software, vol. 2, pp. 87ā€“94, 1976. View at: Google Scholar
  39. G. Yu, L. Guan, and W. Chen, ā€œSpectral conjugate gradient methods with sufficient descent property for large-scale unconstrained optimization,ā€ Optimization Methods and Software, vol. 23, no. 2, pp. 275ā€“293, 2008. View at: Publisher Site | Google Scholar
  40. D. Nguyen and B. Widrow, ā€œImproving the learning speed of 2-layer neural network by choosing initial values of adaptive weights,ā€ Biological Cybernetics, vol. 59, pp. 71ā€“113, 1990. View at: Google Scholar
  41. R. Kohavi, ā€œA study of cross-validation and bootstrap for accuracy estimation and model selection,ā€ in Proceedings of the IEEE International Joint Conference on Artificial Intelligence, pp. 223ā€“228, AAAI Press and MIT Press, 1995. View at: Google Scholar
  42. J. C. Gilbert and X. Jonsson, LIBOPT-an enviroment for testing solvers on heterogeneous collections of problems—version 1. CoRR, abs/cs/0703025, 2007.
  43. L. Prechelt, ā€œPROBEN1-a set of benchmarks and benchmarking rules for neural network training algorithms,ā€ Tech. Rep. 21/94, Fakultät für Informatik, University of Karlsruhe, 1994. View at: Google Scholar
  44. J. Yu, S. Wang, and L. Xi, ā€œEvolving artificial neural networks using an improved PSO and DPSO,ā€ Neurocomputing, vol. 71, no. 4–6, pp. 1054ā€“1060, 2008. View at: Publisher Site | Google Scholar
  45. R. P. Gorman and T. J. Sejnowski, ā€œAnalysis of hidden units in a layered network trained to classify sonar targets,ā€ Neural Networks, vol. 1, no. 1, pp. 75ā€“89, 1988. View at: Google Scholar
  46. A. D. Anastasiadis, G. D. Magoulas, and M. N. Vrahatis, ā€œNew globally convergent training scheme based on the resilient propagation algorithm,ā€ Neurocomputing, vol. 64, no. 1–4, pp. 253ā€“270, 2005. View at: Publisher Site | Google Scholar
  47. P. Horton and K. Nakai, ā€œBetter prediction of protein cellular localization sites with the k nearest neighbors classifier,ā€ Intelligent Systems for Molecular Biology, pp. 368ā€“383, 1997. View at: Google Scholar
  48. P. Liang, B. Labedan, and M. Riley, ā€œPhysiological genomics of escherichia coli protein families,ā€ Physiological Genomics, vol. 2002, no. 9, pp. 15ā€“26, 2002. View at: Google Scholar

Copyright © 2012 Ioannis E. Livieris and Panagiotis Pintelas. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

1480Ā Views | 1001Ā Downloads | 1Ā Citation
 PDF  Download Citation  Citation
 Download other formatsMore
 Order printed copiesOrder

Related articles

We are committed to sharing findings related to COVID-19 as quickly and safely as possible. Any author submitting a COVID-19 paper should notify us at help@hindawi.com to ensure their research is fast-tracked and made available on a preprint server as soon as possible. We will be providing unlimited waivers of publication charges for accepted articles related to COVID-19. Sign up here as a reviewer to help fast-track new submissions.