Table of Contents
Advances in Artificial Neural Systems
Volume 2015, Article ID 931379, 16 pages
Research Article

Stochastic Search Algorithms for Identification, Optimization, and Training of Artificial Neural Networks

Faculty of Management, 21000 Novi Sad, Serbia

Received 6 July 2014; Revised 19 November 2014; Accepted 19 November 2014

Academic Editor: Ozgur Kisi

Copyright © 2015 Kostantin P. Nikolic. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


This paper presents certain stochastic search algorithms (SSA) suitable for effective identification, optimization, and training of artificial neural networks (ANN). The modified algorithm of nonlinear stochastic search (MN-SDS) has been introduced by the author. Its basic objectives are to improve convergence property of the source defined nonlinear stochastic search (N-SDS) method as per Professor Rastrigin. Having in mind vast range of possible algorithms and procedures a so-called method of stochastic direct search (SDS) has been practiced (in the literature is called stochastic local search-SLS). The MN-SDS convergence property is rather advancing over N-SDS; namely it has even better convergence over range of gradient procedures of optimization. The SDS, that is, SLS, has not been practiced enough in the process of identification, optimization, and training of ANN. Their efficiency in some cases of pure nonlinear systems makes them suitable for optimization and training of ANN. The presented examples illustrate only partially operatively end efficiency of SDS, that is, MN-SDS. For comparative method backpropagation error (BPE) method was used.

1. Introduction

The main target of this paper is a presentation of a specific option of direct SS and its application in identification and optimisation of linear and nonlinear objects or processes. The method of stochastic search was introduced by Ashby [1] related to gomeostat. Till 60th of last century the said gomeostat of Ashby’s was adopted mostly as philosophic concept in cybernetics trying to explain the question of stability of rather complex systems having impacts of stochastic nature [2].

The stochastic direct search (SDS) had not been noticed as advanced concurrent option for quite a long time. The researches and developments works of Professor Rastrigin and his associates promoted the SS to be competing method for solving various problems of identification and optimization of complex systems [3].

It has been shown that SDS algorithms besides being competing are even advancing over well-known methods. Parameter for comparing is a property of convergence during solving the set task. For comparing purposes gradient methods were used in reference [4]. The SDS method showed remarkable advance. For systems with noise certain numerical options offer the method of stochastic approximation (MSA) [5]. In some cases procedures of SDS are more efficient than MSA [6].

During the last 20 years, vast interests have been shown for advanced SDS, especially on the case where classical deterministic techniques do not apply. Direct SS algorithms are one part of the SSA family. The important subjects of random search were being made: theorems of global optimization, convergence theorems, and applications on complex control systems [79].

The author has been using SDS algorithms (in several of his published papers) regarding identification of complex control systems [10], as well as synthesis and training of artificial neural networks [1113].

Through experience in application of certain SDS basic definition the author was motivated to introduce the so-called modified nonlinear SDS (MN-SDS) applicable as numerical method for identification and optimization of substantial nonlinear systems. The main reason is rather slow convergence of N-SDS of basic definition and this deficiency has been overcome.

The application of SDS is efficient for both determined and stochastic description of systems.

The SDS algorithm is characterized by introduction of random variables. An applicable option is generator of random numbers [14, 15].

The previously said is enhanced by well-developed modern computer hardware and software providing suitable ambient conditions for creation and implementation of SDS methods and procedures.

2. Method and Materials

2.1. Definition of SDS Method

The solving of theoretical and/or practical problems usually requests firstly an identification task followed by final stage, that is, a system optimization. The analysis and synthesis of systems always consider the previously said [16, 17].

Methods of SDS are ones of competing options for numerical procedures providing solution for identification and optimization of complex control systems [18], but so ANN. Let us start with an internal system description in general form [19]:where and are nonlinear vector functions; and are vector functions of constrains (variables and parameters); is system state vector, is control vector, and is vector of disturbance; are parameters describing the system structure such as constants, matrices, and vectors; is real time; is noise usually added in (2).

A parameters identification of the above system anticipates certain measurements of the system variables observing the criteria function:

The criteria function in (5) is without involvement of constrains and ; in case that constrains are not possible to avoid, is introduced [20]:where and are Langrage multiplicators and , a [20].

When and are rather large and tend to , then both and tend to the same value for variables , that is, corresponding optimal parameters.

Further for the purpose of simplicity of this presentation a function from (5) is to be used and to be the function of one vector variable , so .

Methods of optimization start with iterative form where from current state system transfers into by the following rule:

So, is a function stepping into new state where is a step and is vector function of guiding search. For iterative optimization by using gradient method [21]:

In case of SDS the relation (7) gets the next form:where the direction of search is function of and random vector (Figure 1).

Figure 1: A random variable ; (a) on sphere radius , (b) on cube edge , and (c) on cube edge ; dimension .

If it is introduced termsthe general expression (9) gives some of basic algorithm of SDS.

2.1.1. Nonlinear SDS


2.1.2. Linear SDS


Some of the more complex forms of SDS are as follows.

2.1.3. SDS Stochastic Gradient

Considerwhere is number of tests before effective step forward.

2.1.4. SDS Fastest Descent

Iterative procedure for a SDS fastest descent is

The graphic presentation of SDS behavior in the space of gradient is given in Figure 2.

Figure 2: SDS algorithms in gradient field.

Gradient is starting from point , linear SDS starts from point , nonlinear SDS starts from point , and SDS-statistic gradient starts from point (vector pairs marked dash opposes failure tests). The gradient-fastest descend is presented from starting point . SDS fastest descend is not presented in Figure 2 (it is similar to linear SDS from ).

The random vector is created on -dimensional sphere , or on -dimensional cube; is system dimension in Euclide space . A presentation is shown in Figures 1(a), 1(b), and 1(c) for .

The SSA properties used for ranking with other competitive methods (gradient, fastest descend, Gaus-Zeidel, and scanning and others) are (i)local: describing algorithm in -step, (ii)nonlocal: describing algorithm from start to the final stage of optimization.

The main property is dissipation, that is, losses in one step of search and displacement on hipper-surface of the function . The said properties describe the algorithm convergence, that is, the method itself.

The convergence is defined by expression:that is, ratio of math expectation of test number in -iteration over relative change of (criteria function ) [22].

Reciprocal value of relation (15) gives an average value of in -step of searching.

The local SDS property includes probability of failing step of searching:where is vector of initial state and is working step of state change in . The probability of failing step is rather important regarding choosing of algorithm for real process optimisation.

So, the choice of optimal algorithm from local property point of view guides to compromise of choice of three:

Besides local properties the relevant ones are properties describing the search in whole [23] which are (i)total number of steps, that is, number of effective steps , (ii)accuracy, that is, allowed error the procedure is ending or relative error in %.

The SDS option choice is guided by the above indicated local characteristics properties as well as nonlocal ones .

It is necessary also to observe that the dispersion of could be useful in some statistical estimation of searching procedure properties. Also, it is necessary to notice that dispersion depends on how the vector is generated (hipper sphere or cube) [23].

Finally, it is worthwhile to mention that the choice of algorithm is subject to request to have procedures (identification and optimization) effective, having in minds that system model or criteria function is nonlinear. Substantially nonlinear process models so, described by (1) not having linear approximation which simulates original behavior (1) within narrow range of variables and parameters change against defined error limits. SDS methods are effective in this case so [22, 23].

2.2. Modified Nonlinear SDS: The Definition

Good properties of SDS as per (11) are simplicity, reliability, and applicability over nonlinear systems but slow convergence during the optimization procedure shows weak point. A rough explanation of nonlinear SDS of basic definition within gradient field is shown in Figure 2, with starting point . Comparing with regular gradient method this SDS becomes competitive when the system dimension is over [24].

By increasing the system dimension the probability for acceptable tests decrease indicating that the target is reached after rather numerous tests and effective steps . The stochastic gradient is SDS having competitive performance during optimisation of nonlinear systems; however, the algorithm itself accumulates information and becomes fed up in numerical proceeding.

The idea to have failed tests converted onto useful accumulated information guides toward the so-called modified nonlinear SDS. The previously said is shown in Figure 2. For the 3 failed tests from starting point it effects vector to turn over toward the target under angle . So, if the accumulation of failed tests is , between and successful steps on hyper area of then:

Now it is possible to form the next form of SSA search:where is successful test after failed, , ; , generate of accumulation information in MN-SDS algorithm and possibly being used so:where corresponds to max .

A modification defined by (18) and (19) is naturally linked on basically defined nonlinear SDS and further on will be referred to as MN-SDS. At certain extent MN-SSA possesses property of nonlinear algorithm, linear SDS with accumulation and extrapolation and stochastic gradient.

Having in mind that MN-SDS explore accumulated information some sort of self-learning features are obtained (specifically when all data are memorized) leading to the conclusion that stochastic probability of testing depends on accumulated information:where, memory vector, could be defined as

This brings that searching is guided within the vicinity of the best probe or test [23]; a memory vector indicates adaptation features and can be calculated for step likewherewhere is coefficient of erasing and is coefficient of learning intensity or self-learning.

Now the vector guidance toward the target is where is corrected direction by self-learning (sl) on -step; , , “steps of accumulation” ( failed test after step).

In practice the optimisation starts without self-learning option. In that sense MN-SDS regardless of the range of accumulated information of failed tests not memorized enables sampling of those most useful ones.

3. Theoretical Importance for Application of MN-SDS on ANN

3.1. Basic Theoretical Notes of MN-SDS

The main result achieved by MN-SDS is improved convergence compared to that of nonlinear SSA of basic definition (Section 2.1, relation (11)).

For the SSA purposes uniform distribution of random numbers within chosen interval of real axes is used. Most often it is interval , Figure 1. More complex distribution produces difficulties in numerical calculation [25].

By increasing of system dimension the probability of each testing is decreased; . The consequence of the previously said is less and less successful testings producing more steps toward the target.

The idea to use failed SDS tests for increase of efficiency of nonlinear SDS initiated creation of the so-called modified nonlinear SDS (MN-SDS). In fact failed tests give accumulated information for increasing of N-SDS convergence. Rearranged iterative procedure for minimization of criteria function mentioned in Section 2.2 brought rather effective procedure, that is, MN-SDS.

The MN-SDS procedure provides acceptable choice referring to the set of three characteristics , number of steps , and set-up error during aiming the target.

The convergence for N-SDS algorithm with prediction [23, 26] for system dimension is given by the following relations:

This is shown like curve ABC in Figure 3. The curve is the boundary case for search with MN-SDS algorithm when after -step exist only one failed test [24]. In the same Figure 3 it is given for gradient method with single testing (AD) and in case pair () testing (AE). Sum feathers in Figure 3 has been taken over reference [24].

Figure 3: Comparability of convergence MN SDS and N-SDS with gradient.

Nonlinear SSA (N-SDS) of basic definition (the curve in Figure 3) has better local convergence than gradient method when system dimension is . MN-SDS confirms that SS algorithms overcome gradient method in efficiency whenever system dimension rapidly increases; . SDS numerical procedures are simple, specifically when is uniform distribution by system dimension increase. In case very large number of dimensions the concerns in term of efficiency between SDS methods and the gradient method changed in the favor of the gradient [24]. MN-SDS its advantages over gradient procedures retains much longer than most SDS algorithms.

It is worthwhile to recognise when to use SDS or some other optimisation method. In that sense in Figure 4 is shown that gradient procedure is more effective within the vicinity of the optimum (Figure 4 has been taken over on [26]). The MN-SDS in certain situation incorporate features of nonlinear SSA and stochastic gradient (SG-SDS). The stochastic gradient is based on accumulation of numerous information and as such it targets optimum with the same efficiency as regular gradient method. Regular-determined gradient method could be considered as specific case of stochastic gradient [26].

Figure 4: The applicability areas of gradient and SDS methods.

It is necessary to mention that random numbers generator should pass more strict tests whenever the system dimension is large [25]. Figure 5 shows diagram of MN-SDS numerical procedure. The random vector generator is shown as outside device just to point necessity of such generator. The said generator is SW-software solution within the computer system in form an application package [15].

Figure 5: Diagram of procedure for computer processing of MN-SSA.
3.2. SDS Implementation for ANN Training

Hereinafter an implementation of MN-SDS over multilayer ANN with feedforward information flow through network will be considered (FANN).

The FANN properties (well adopted by researchers and designers) enable wide range of ANN to be transformed into FANN.

The SDS, that is, MN-SDS, can be applied on both FANN model forms, oriented graph and matrix form.

For MN-SDS the first mentioned form is more practical having in mind available heuristic options offering a more efficient MN-SDS.

In this part of the paper and onward a multilayer FANN will be observed as shown in Figure 6.

Figure 6: The FANN a general type (a) and used model of neuron (perceptron) (b) where .

After adjustment of symbols (in expression (19)) for an ANN, the following form is obtained for MN-SDS: where areincrement vector in optimization process iterationand are random vectors of uniform distribution. The cost function will stay observing that now .

The vector of parameter is changed in pace with iterative optimization procedure as follows:

The vector dimension is determined by the ANN level complexity and also complexity of an optimization procedure, that is, training:where vector has coordinates as random vector where is set of all parameters in parametric space, while means transposition into column matrix. Gradient method and so the backpropagation error (BEP) method use this iterative relation:

Stochastic methods (also including MN-SDS) instead of use random vector which is heuristically chosen enabling iterative parametric optimization with certain efficiency measured by convergence range. Previously the rank of MN-SDS compared to SSA of SDS type as well as other gradient procedures has been set up.

An application of MN-SDS on the FANN (Figure 6) is used known linear interaction function and corresponding output :where is: (i)—components of weights of vector , (ii)—neurons in layer , (iii)—neurons in layer , (iv)—all layers in network, (v) and —number of neurons for adjacent layers.

Application MN-SDS algorithm of the training FAAN involve the introduction of random vector of the same size as vector :The correspondents between components and must be make before of feedward numerical procedure:For the each training pair is made training; that is, minimization of criteria function . In the set of training there is pairs; , . If the training of network performed for entire set , then it is achieved an epoch of training.

If in the FANN there are more outputs than only one, previously it must be from an error for one output:After this it can to form criteria function :where it is: and .

The increment of , by weights of the layer , for one step iteration can be presented as follows:the indexes denoted the same in the previous expressions.

is local gradient, an important characteristic in BPE method [27, 44].

can be calculated via SDS procedures (with application the relations (37a) and (37b)) but only through MN-SDS or SDS gradient which gives the BPE primal version [27].

3.3. SDS Algorithms and FANN Synthesis

Synthesis of ANN is engineering design. An ANN design starts with set-up of an initial architecture based on experience and intuition of a designer. In this Section 3.3 it was presented the formally recommendations which are relatively good orientation in design of FANN.

An ANN architecture is determined by the number of inputs and outputs, neurons, and interlinks between them and biases a perceptron is a neuron with adjustable inputs and activation function not necessary to be of step type-threshold.

Experience shows, that is quite clear that for solution of complex problem it is necessary to create a complex ANN architecture.

Working with SDS various experiences have confirmed the previously noticed. The SDS methods are more efficient comparing to numerous known ones specifically when complex optimization problems are in question specifically when complexity is measured by system dimensions. Having in mind the significance of multilayer of an FANN hereinafter the structure shown in Figure 6 will be considered.

It is worthwhile to mention that successful optimization process of an FANN does not mean that it would have necessary features: first of all required capacity and properties for generalization [29].

Capacity features () is one of ANN properties to memorize certain information during the process of training, that is, learning.

An FANN property of generalization is basic paradigm of its performance. In fact it is an essential property approving that “network has learnt” and should provide valid answers to inputs not within training pairs. In other words testing data should have the same distribution as a set of training pairs.

The synthesis issue is an open problem. In [29, 30] some theoretical results have been shown mostly for three layer networks processing binary data. Some researches were working to attempt and expand implementation on three-layer (and multilayer) processing analogue information what have brought the so-called universal approximator [3133].

The name universal approximator is linked to a three-layer network having perceptrons in hidden layer with nonlinear activation function (type sigmoid) and perceptrons at outputs with linear activation function (Figure 7).

Figure 7: Basic FANN architecture of universal approximation.

By introducing the following designations, for hidden neurons , for other neurons , for interlinks-synapses , for threshold , then by simplifying the theoretical results of [29, 3133] certain indicators are obtained as relatively good orientation for creation of a FANN architecture.

When a starting network structure has been set up, then its dimension is

The range of training pairs samples for level of generalization above 90% (expressions (39), (40) and (41) represent compress (in simple form) of ideas in references [29, 3133]) is

The FANN ability to memorize and capacity are determined by relationif the following condition is fulfilled and are number of network inputs and outputs respectively (Figure 7).

In case of collision between ANN dimension and required training samples changing of is required. It is point out that is in collision with generalization . For training under the same conditions better generalization is got for networks with less neurons number. The aforesaid indicates that ANN capacity is reduced [34].

Previous consideration with fix configuration during training procedures is characterized as static.

There are some methods approaching to optimization dynamically; during training procedures network structures are changed such as cascade correlation algorithm [35], tiling algorithm [36], dynamic learning by simulated annealing [37], and others. Most complex one is dynamic training by simulated annealing [37, 38]. The aforesaid method resolves numerous FANN synthesis problems.

Training with MN-SDS is processed through forward phase. Numerical procedure is simple and gives enough information SDS shows on dynamically approach in training, optimization and syntheses of artificial neural networks.

3.4. Examples

Example 1 (searching by matrix type). This example is concerned with the theory and system control [39]. Here is presented to show when the SS model system works in matrix form as well as differences in the efficiency of algorithms N_SDS and MN-SDS.
The linearized multivariable system described in internal presentation (see relation (1) and (2) (in Section 2.1)) is as follows:where are matrix of parameters of system; are vectors with corresponding dimensions.
If the static of the real system is described by matrix form,
In the expression (42)  , then the reduced equations describing the steady-static behavior of the system is:where and are the matrix of parameters corresponding to the static of the observed system.
The relation (43) in the developed form can represent energy network with passive consumers:where , , and are sizes that can be traced through the check in checkpoint; includes certain set of all measurable sizes in the system (42).
Linear forms type (44) are more complex in numerical processes of form type:Checks consistency and livelihoods solutions [40]:is inapplicable to form (44).
Having in mind the previously indicated, numerical experiment for coefficients and ; ; ; identification has been created. In fact the procedures with SSA algorithm have been used. The data of were collected by changing variables and that are in Table 1.
For identification of and matrix searching is used with random matrix , generated on hipper sphere in -space with [25].
The current values of the sizes , , and monitored through the checkpoint. At certain intervals performs registration thereof. With a certain number of the same is formed by a collection of required data (Table 1). A series of 28 randomly selected numbers filled matrix and ; in each iteration of optimization process. The use of any of the algorithms N-SDS or MN-SDS requires the formation of error : , ; ; is number of last iteration , and then coresponding function criteria is:where are components for the random selection parameters and required measurement values.
For step iteration used ; 0,001; and 0,0001 respectively. The initial random parameters of this procedure are
The final results after random procedure with N-SSA are
The accuracy of % calls to have and number of steps . There is no noise in system.
Implementation of MN-SDS is possible after transforming of equation system (44) into matrix form:
By application of MN-SDS method some = 2,156 steps are needed indicating that some 4 times less steps are required for the same accuracy of .

Table 1: Collection data of variables , , .

Example 2 (training of multilayer perceptron). This example is linked to treatment of training of perceptron related to “XOR” logic circuit by using of SDS procedures. The said example had an important role in R&D works in the field of artificial neural network. In fact Minsky and Pappert (1969) confirmed that perceptron cannot “learn” to simulate XOR logic circuit and not to expect much of ANN [41].
If from XOR true table training pairs are formed then it is possible to obtain similar conclusion as Minsky and Pappert. If we observe the definition of neuron (perceptron) as per McCulloch and Pitts [42] for neuron with two inputs,(1); or ,(2); or ,(3); or ,(4); or ,where are training outputs: .
It is obvious that relations (2), (3), and (4) are excluding each other. A perceptron cannot be trained to simulate XOR logic circuit.
The aforesaid is understood as obstacle “in principum.
The previously indicated somehow interrupted further development of ANN more than 20 years up to works of Hopfield [43] and others.
The problem has been overcome by studying of multilayer and recurrent ANN properties as well as creation of advance numerical procedures for training of the same [27].
This example incorporates the results of training having at least one hidden layer of neurons which could be trained to simulate XOR logic circuit.
Figure 8(a) shows an FANN configuration having one hidden layer of two neurons and one neuron at the ANN output. That ANN can learn to simulate XOR logic circuit.
For training pairs two options can be used (Figures 8(b1) and 8(b2)):
Further on results realized with training pairs shown in Figure 8(b1) shall be used. It is necessary to observe that for some of variables should have fixed values since do not contribute in solving the problem:
At the time of training it was shown that , , and have very little variations near number 1 and it is possible to used
Values of all other and are changeable parameters where is indication of neuron layer. By that the dimension of the random vector is decreased. For the first training pair and activation function of logistic typetraining will be performed through BPE method [27, 28] and MN-SDS; presents a linear function interaction of neuron inputs.
The criteria function for both cases is
The SSA iterative procedures were used to minimize ; the results are presented in Figure 9(a). The cost functions are formed for the same training pairs and initial values, and ponderation is done against the highest value of one of .
In Figure 9(a) diagram has been singed with number 4, is criteria function of training by NN-SDS method.
The results of training are shown so for a step activation function (threshold):
The method BPE is disabled since is not differentiable. For application BPE in this case it must be to approximate with logistic function ; . The process optimization is finished after more than 600 iterations (Figure 9(b)). The final results for MN-SDS of the ponderation as shown in Figure 9(b); training was done with set. SS procedure with MN-SDS has been initiated with random parameters (weights): ; ; ; ; .
Finally results after iterations are , , , , . with relative error of 2%.
The random vector of forward propagation is with dimension is:
Let us refer to an example when an activation function is given by relation (56), with training pairs . Then it shows that a training procedure depends on choice of training pairs as well. The minimization process of such case is done rather faster (Figure 9(b)).
BPE methods implementation was made with known procedures [27, 28]. BPE is used as comparative method.
In this paper MN-SDS and BPE use ECR (procedures error correction rules) which are more convenient for “online” ANN training [45].
Whenever the training coefficient () is small then ECR and GDR procedures provide the same results for parameters estimation within set-up range of an acceptable error [44, 45]. GDR (gradient rules) is called “batch” optimization procedure.

Figure 8: Optimization, for example, training of multilayer ANN; XOR problem.
Figure 9: Cost function; (a) for , (b) for step function; BPI optimization is finished after 600 iterations.

Example 3 (synthesis FANN to approximate model FSR). In this example some of theoretical R&D results of [2933] are applied.
The real problem is approximated model of an ANN training program related on technical term for process in reactor in which the concentrate heated to powder so that will behave like a fluid [46, 47].
When the said FSR is to be tempered either the first time or after service attempts, there is a program of tempering to temperature as per diagram (see Figure 10). After reaching working temperature it is maintained by control system. The tempering to is to be completed within 130 min (range ). In Figure 10 the whole cycle is within 240 min. The FSR was under operation 40 minutes (range ). Due to some maintenance problem the FSR is often shut down (point ). Before next campaign the FSR should be cooled down (range ) for maintenance purposes and afterwards the tempering operation is repeated.
The mapping in real conditions have not enough data to make FANN (Figure 11, [46]). There is not enough, to model approximations over ANN that learns. Overall number of pairs collected from the said diagram (Figure 10) is 80 pairs; sampling period is 3 min. More frequent sampling is not realistic. The task is to determine a FANN, which is to give the best approximate model of the FSR within required accuracy. The relative error of correct response should be more than 3% of obtained measuring data.
On Figure 12 is given a starting architecture of FANN : with one input and one output, with 10 neurons in the first and in second layers. Only neurons in the second layer has non-linear activation function. All neurons has a linear interaction function.
Application of the relations ((39)–(41)) of Section 3.2 on gives: interlinks between unknown parameters and biases. required training pairs for generalization !, memorized information under condition that , ; condition is satisfied.
Based on the aforesaid there is overnumbered required training pairs .
Basic moments in conjunction approximation model FSR through the use of MN-SDS in the training of FANN are that (i)assignment training   does of 80 pairs of training is achieved through the diagrams on Figure 10 that define Table 2, (ii)random vector replaces and at expression (32), so , unknown parameters, (iii)feedforward phase in layers for each training pair achieved numerical value for and , to the expression (32), (iv)cost function for sequence for each training par , , was calculated with the expression (36). Figure 13 presents the trend of for , (v)the procedure optimization, that is, training for FANN takes more than 1000 iterations.
Due to the big volume of numerical data in the paper, we will not list all the details.
Out of the two ANN under training with same training samples more strong generalization has network with less number of neurons.

Table 2: Data defining program warming FSR (, ).
Figure 10: Plan introducing FSR in operative mode (AB), work (BC), possibly down time (CD).
Figure 11: The real mapping data of FSR at operative environment.
Figure 12: The initial structure of FANN for approximation model’s tempering of FSR.
Figure 13: Trend cost functions multilayers FANN’s and .

That is the reason to refer to less complex network structures:

, and .

Structure eventually could be trained and have certain generalization possibility since for interlink, that is, unknown parameters, required training pairs for ; !,, condition ; , barely accepted.

More acceptable solution is , although it presents a bared architecture FANN, since for , dimension of ,; there are not 100 but only 80 training pairs,, condition ; , acceptable.

It has been possible to continue to a minimized architecture , but choice of provides better performance of a hidden layer.

The closest solution is the FANN , Figure 14. In hidden layer this network has 3 perceptrons with sigmoid type for activation function.

Figure 14: Oriented graph of .

Having in mind that determining of an ANN, that is, FANN architecture is always open problem then estimation of an adopted structure is suitable after the final numerical result including validity test of a network training.

Theoretical results for universal approximator are derived for nonlinear activation, function of hidden neurons of type .

Since bring an option to use tanh function for a FANN model approximation.

Application the relation of and in the expression (32), for structure on Figure 14, in general form:represents an approximation model of the temperature regime of FSR.

Here will be presented numerical data and the results of network training for by MN-SDS only.

On the beginning of the numerical procedure for practical reasons and should be reduced 100 times.

The symbolic presentation of the vector unknow parameters , in the network , at the beginning () and end of training procedure () is given by:The initial random value of the parameters is:

Random vector in this case is

Behind the training network by algorithm MN-SDS after 700 iterations vector of unknown parameters is given by

Previous data characterize the approximation process model tempering temperature of FSR (58), overtraining FANN by the MN-SDS algorithm, with 5% of the relative error. Trend of for is given on Figure 13.

Some responses to the test inputs for checking the validity of the obtained model deviate a large error. Counting these to interrelate (sharing) the total number received relatively rough estimate of generalization capabilities appropriate network. Based Figure 10 test values have special graphic symbols (, for MN-SDS and and for BPE). For a training set of 80 pairs generalization ability of the network is about 70%. For the network it is about 20%. Previous values obtained training using MN-SDS.

Application BPE method gave the following values of generalization: about 70% for network and below 20% for the network .

The previous presented FSR model could be used in more sophisticated option as an intelligent process monitoring.

4. Discussion

Why someone should go to the application of stochastic search methods (SSMs) to solve problems that arise in the optimization and training of ANN? Our answer to this question is based on the demonstration that the SSMs, including SDS (stochastic direct search), have proved to be very productive in solving the problems of complex systems of different nature.

In particular, previous experience with ANN and relatively simple architecture suggest that they can exhibit quite a complex behavior, which can be traced to (i) a large number of neuron-perceptron elements in ANN system, (ii) the complexity of the nonlinearity in the activation function of neurons within ANN, (iii) the complexity of the neuron activation function model (i.e., higher level models), (iv) complexity in the optimization procedure due to the large volume of data in the training set, and (v) the complexity of the specification of internal architecture of particular types of ANN.

The features listed above require competitive methods which can deal efficiently with such complexity. The SDS represent a combinatorial approach offering great potential via certain heuristics and algorithms they provide for numerical procedures.

For example, various methods based on the notion of gradient and which are considered competitive when applied to complex systems cannot avoid the linear scaling of convergence in numerical implementations. In SDS, the trend is roughly speaking proportional to where represents the dimension of the vector of parameters in the parameter space of a complex system. This indicates that, with increasing complexity of the system, the relative advantage of SDS method increases when compared to the gradient scheme. That is the key conclusion of why code optimization and training of ANN should employ SDS.

The author has previously used some algorithms belonging to the SDS methodology, such as nonlinear SDS (N-SDS) and statistical gradient (SG-SDS). Both of these methods have exhibited poor convergence. By starting from N-SDS, we have designed MN-SDS which already for is superior to gradient descent (with original N-SDS this is achieved only for ).

In this paper, Section 2 overviewed the concept of SDS and SSA and then introduced MN-SDS. Section 3 examined the possibilities of MN-SDS algorithm and its application on FANN as the target architecture. Section 3 also presents steps in synthesis of FANN in order to emphasize that performed optimization of FANN does not guarantee that the network will achieve the required level of generalization (i.e., ability to learn). Generally speaking the problem of syntheses ANN remains open.

The present synthesis elements are simplified theoretical results of the recent studies. This is illustrated by Example 3, which is directly connected to practical applications. Example 2 should give an insight about the relationship between N-SDS and MN- SDS, as well as connection between MN-SDS and BPE methods, where the latter was used as a reference. Example 1 confirms efficiency of MN-SDS methods for problems outside of ANN.

Let us finally mention that there is an increasing interest in using SSM, both from academia and industry. This is due to the fact that SSM, and in a particular SDS, can find increasing applications in economics, bioinformatics, and artificial intelligence, where the last area is intrinsically linked to ANN.

5. Conclusion

The central goal of this study is the presentation of stochastic search approach applied to identification, optimization, and training of artificial neural networks. Based on the author’s extensive experience in using SDS approach to the problems of identification and optimization of complex automatic control system, a new algorithm based on nonlinear SDS (N-SDS), which is termed MN-SDS, is proposed here. The MN-SDS offers significant improvement in convergence properties compared to nonlinear N-SDS and some other SSA.

MN-SDS maintains all the other good features of the existing SDS: a relatively easy adaptation to problem solving; simple mathematical construction of algorithmic steps; low sensitivity to noise.

The convergence properties of MN-SDS make it superior to majority of standard algorithms based on gradient scheme. Note that convergence is the most suitable characteristics for comparing the efficiency of algorithms for systems with the same number of optimization parameters. For example, already for more than three parameters the MN-SDS exhibits better convergence properties than most other algorithms, including those based on the gradient. This means that, in certain optimization procedures and training, MN-SDS is superior to widely used BPE method for ANN and in the development of artificial intelligence.

The MN-SDS in optimization and training of ANN employs only feedforward phase flow of information in FANN. The parameters that are used in optimization within MN-SDS are changed using random number generator. The efficiency of MN-SDS in numerical experiments suggests that it can be applied to very complex ANN. This study has shown its application to feedforward ANN (FANN). The obtained results were compared with results obtained with BPE method, of course, when applied to the same problems.

Numerical experiments performed here can be implemented even on simple multicore PC using MATLAB package.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.


The author has been using the achievements of Professor Rastrigin courses and expresses his gratitude. Unfortunately there are no possibilities for direct thanks for using some elements for the figures out of the Professor’s books.


  1. W. R. Ashby, “Cybernetics today and its adjustment with technical sciences in the future,” in Computer Machine and Logical Thinking, Compman and Hall, 1952. View at Google Scholar
  2. L. A. Rastrigin, “Ashby’s Gomeostat,” in Stochastics Search Methods, pp. 50–58, Science, Moscow, Russia, 1965. View at Google Scholar
  3. L. A. Rastrigin, The Theory and Application Stochastics Search Methods, Zinatne, Riga, Latvia, 1969.
  4. K. K. Ripa, “Comparison of steepest deescent and stochastics search self-learning methods,” in Stochastic Search Optimization Problems, Zinatne, Riga, Latvia, 1968. View at Google Scholar
  5. A. Dvorezky, “On stochastisc approximation,” in Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, Calif, USA, 1956. View at Google Scholar
  6. L. A. Rastrigin and L. S. Rubinshtain, “Comparison of stochastic search and stochastic approximation method,” in The Theory and Application Stochastics Search Methods, pp. 149–156, Zinatne, Riga, Latvia, 1968. View at Google Scholar
  7. A. A. Zhigljavsky, Theory of Global Random Search, Kluwer Academic, Boston, Mass, USA, 1991. View at MathSciNet
  8. N. Baba, T. Shoman, and Y. Sawaragi, “A modified convergence theorem for a random optimization method,” Information Sciences, vol. 13, no. 2, pp. 159–166, 1977. View at Publisher · View at Google Scholar · View at MathSciNet
  9. J. C. Spall, “Introduction to stochastic search and optimization: estimation, simulation and control,” Automation and Remote Control, vol. 26, pp. 224–251, 2003. View at Google Scholar
  10. K. P. Nikolic, “An identification of complex industrial systems by stochastic search method,” in Proceeding of the ETAN '79, vol. 3, pp. 179–186, 1979.
  11. K. P. Nikolic, “Neural networks in complex control systems and stochastic search algorithms,” in Proceeding of the ETRAN '09 Conference, vol. 3, pp. 170–173, Bukovička Banja, Aranđelovac, Serbia, 2009.
  12. K. P. Nikolic and B. Abramovic, “Neural networks synthesis by using of stochastic search methods,” in Proceeding of the ETRAN '04, pp. 115–118, Čačak, Serbia, 2004.
  13. K. P. Nikolic, B. Abramovic, and I. Scepanovic, “An approach to synthesis and analysis of complex recurrent neural network,” in Proceedings of the 8th Seminar on Neural Network Applications in Electrical Engineering (NEUREL '06), Belgrade, Serbia, 2006.
  14. J. A. Gentle, Random Number Generation and Monte Carlo Method, Springer, New York, NY, USA, 2nd edition, 2003. View at MathSciNet
  15. C. B. Moler, Numerical Computing with MATLAB, SIAM, Philadelphia, Pa, USA, 2003.
  16. P. Eykhoof, “Some fundamental aspects of process-parameter estimation,” IEEE Transactions on Automatic Control, vol. 8, no. 4, pp. 347–357, 1963. View at Publisher · View at Google Scholar
  17. C. S. Beighlar, Fundamental of Optimization, 1967.
  18. L. A. Rastrigin, Stochastic Model of Optimization of Multiple Parameters Objects, Zinatne, 1965.
  19. J. T. Tou, Modren Control Theory, McGraw-Hill, New York, NY, USA, 1964.
  20. J. Stanic, “Langrage's method of multiplicators,” in Book Introduction in Techno—Economic Theory of Process Optimization, pp. 35–40, Faculty of Mechanical Engineering, Belgrade, Serbia, 1983. View at Google Scholar
  21. G. A. Korn, “Derivation operators,” in Mathematical Handbook for Scientists and Engineers, pp. 166–170, McGraw-Hill, New York, NY, USA, 1961. View at Google Scholar
  22. L. A. Rastrigin, “Stochastic local search algorithms,” in Book Stochastics Search Methods, pp. 64–102, Science, Moscow, Russia, 1968. View at Google Scholar
  23. L. A. Rastrigin, “Characteristics of effectiveness of stochastic search method,” in Stochastics Search Methods, pp. 32–41, Science Publishing, Moscow, Russia, 1986. View at Google Scholar
  24. L. A. Rastrigin, “Comparison of methods of gradient and stochastics search methods,” in Book Stochastics Search Methods, pp. 102–108, Science, Moscow, Russia, 1968. View at Google Scholar
  25. K. P. Nikolic, “An approach of random variables generation for an adaptive stochastic search,” in Proceeding of the ETRAN '96, pp. 358–361, Zlatibor, Serbia, 1996.
  26. L. A. Rastrigin, “Multistep algorithms in the central field,” in Book Stochastics Search Methods, pp. 95–103, Science, Moscow, Russia, 1968. View at Google Scholar
  27. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. View at Publisher · View at Google Scholar · View at Scopus
  28. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representation by error propagation,” in Parallel Distributed Processing Explorations in the Microstructures of Cognition, D. E. Rumelhart and J. L. Mc Clelland, Eds., vol. 1, pp. 318–362, MIT Press, Cambridge, Mass, USA, 1986. View at Google Scholar
  29. E. E. Baum and D. Haussler, “What size net gives valid generalization?” Neural Computation, vol. 1, no. 1, pp. 151–160, 1989. View at Publisher · View at Google Scholar
  30. E. B. Baum, “On the capabilities of multilayer perceptrons,” Journal of Complexity, vol. 4, no. 3, pp. 193–215, 1988. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  31. K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural Networks, vol. 3, no. 5, pp. 551–560, 1990. View at Publisher · View at Google Scholar · View at Scopus
  32. K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251–257, 1991. View at Publisher · View at Google Scholar · View at Scopus
  33. M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Networks, vol. 6, no. 6, pp. 861–867, 1993. View at Publisher · View at Google Scholar · View at Scopus
  34. J. Flacher and Z. Obradović, “Constructively learning a near-minimal neural network architecture,” in Proceedings of the International Conference on Neural Networks, pp. 204–208, Orlando, Fla, USA, 1994.
  35. S. E. Fahlman and C. Lobiere, “The Cascade-corellation learning architecture,” in Advances in Neural Information Processing Systems, D. Touretzky, Ed., vol. 2, pp. 524–532, Morgan Kaufmann, San Mat, Calif, USA, 1990. View at Google Scholar
  36. M. Mezard and J.-P. Nadal, “Learning in feedforward layered networks: the tiling algorithm,” Journal of Physics A: Mathematical and General, vol. 22, no. 12, pp. 2191–2203, 1989. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  37. S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983. View at Publisher · View at Google Scholar · View at MathSciNet
  38. S. Milenkovic, “The idea of adaptive selection of type preturbations in the algorithm of simulated annealing,” in Proceedings of the XXXVI YU Conference for ETRAN, vol. 2, pp. 67–74, Kopaonik, Serbia, 1992.
  39. K. P. Nikolic, “An implementation of stochastic search for complex systems identification and optimization,” in Proceedings of the ETRAN '82, vol. 3, pp. 221–227, Subotica, Serbia, 1982.
  40. G. A. Korn and T. M. Korn, Mathematical Handbook for Scientists and Engineers, McGraw-Hill, New York, NY, USA, 1961.
  41. M. Minsky and S. Pappert, “Perceptrons,” in An Introduction to Computational Geometry, MIT Press, Cambridge, Mass, USA, 1969. View at Google Scholar
  42. W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The Bulletin of Mathematical Biophysics, vol. 5, pp. 115–133, 1943. View at Google Scholar · View at MathSciNet
  43. J. J. Hopfield, “Neural network and physical systems with emergent collective computational abilites,” Proceedings of the National Academy of Sciences of the United States of America, vol. 79, pp. 2554–2558, 1992. View at Google Scholar
  44. S. Haykin, “Summary of the back-propgation algorithm,” in Book Neural Networks (A Comprehensive Foundation), pp. 153–156, Macmillan College Publishing, New York, NY, USA, 1994. View at Google Scholar
  45. S. Milenkovic, “Algorithms for artificial neuron networks training,” in Ph.D dissssertation: Annealing Based Dynamic Learning in Second—Order Neuron Networks, (“Artificial Neuro Networks” Library Disertatio—Andrejevic, Belgrad, 1997), pp. 29–44, Univecity of Nish, ETF, 1996. View at Google Scholar
  46. K. P. Nikolić, “An identification of non-linear objects of complex industrial systems,” in Proceedings of ETRAN '98—XLII Conference for Electronics, Computers, Automation, and Nuclear Engineering, pp. 359–362, Vrnjacka Banja, Yugoslavia, 1998.
  47. G. M. Ostrovsky and Yu. M. Volin, “The mathematical description of process in fluo-solid reactors,” in Methods of Optimization of Chemical Reactors, pp. 30–47, Chemie, Moscow, Russia, 1967. View at Google Scholar