Research Article  Open Access
Víctor UcCetina, Francisco MooMena, Rafael HernandezUcan, "Composition of Web Services Using Markov Decision Processes and Dynamic Programming", The Scientific World Journal, vol. 2015, Article ID 545308, 9 pages, 2015. https://doi.org/10.1155/2015/545308
Composition of Web Services Using Markov Decision Processes and Dynamic Programming
Abstract
We propose a Markov decision process model for solving the Web service composition (WSC) problem. Iterative policy evaluation, value iteration, and policy iteration algorithms are used to experimentally validate our approach, with artificial and real data. The experimental results show the reliability of the model and the methods employed, with policy iteration being the best one in terms of the minimum number of iterations needed to estimate an optimal policy, with the highest Quality of Service attributes. Our experimental work shows how the solution of a WSC problem involving a set of 100,000 individual Web services and where a valid composition requiring the selection of 1,000 services from the available set can be computed in the worst case in less than 200 seconds, using an Intel Core i5 computer with 6 GB RAM. Moreover, a real WSC problem involving only 7 individual Web services requires less than 0.08 seconds, using the same computational power. Finally, a comparison with two popular reinforcement learning algorithms, sarsa and Qlearning, shows that these algorithms require one or two orders of magnitude and more time than policy iteration, iterative policy evaluation, and value iteration to handle WSC problems of the same complexity.
1. Introduction
A Web service is a software system designed to support interoperable machinetomachine interaction over a network, with an interface described in a machineprocessable format called Web Services Description Language [1]. A Web service is typically modeled as a software component that implements a set of operations. The emergence of this type of software components has created unprecedented opportunities to establish more agile collaborations between organizations, and as a consequence, systems based on Web services are growing in importance for the development of distributed applications designed to be accessed via the Internet.
When a Web service is requested, all available Web services descriptions must be matched with the requested description, so that an appropriate service with the desired functionality can be found. However, since the number of available Web services is continuously growing year by year, finding the best match is not a trivial problem anymore, especially if we take into account that the matching criteria must consider not only the desired functionality, but also other attributes such as execution cost, security, performance, and so forth.
If individual Web services are not able to meet complex requirements, they can be combined to create composite services [2]. A composite Web service has one initial task and one ending task, and between the initial and the ending tasks there can be individual tasks connected in sequential order. To create a composite Web service it is necessary to discover and select the most suitable services. The complexity of WSC involves three main factors: the large number of dynamic Web Services instances with similar functionality that may be available to a complex service; the different possibilities of integrating service instance components into a complex service process; various performance requirements (e.g., endtoend delay, service cost, and reliability) of a complex service.
1.1. Related Work
Some approaches to solve the WSC problem have focused on different graphbased algorithms [3–8]. Some others have proposed to use optimization methods specially designed for solving constraint satisfaction problems, such as integer programming [9], linear programming [10], or methods for solving the knapsack problem [11]. Artificial intelligence methods such as planning algorithms [12–14], ant colony optimization [15], fuzzy sets [2], and binary search trees [16] have been used too.
The use of methods based on Markov decision processes (MDPs) for the composition problem is certainly not new. In [17], the problem of workflow composition is modeled as a MDP and a Bayesian learning algorithm is used to estimate the true probability models involved in the MDP. In [18], the WSC is solved using QoS attributes in a MDP framework with two versions of the value iteration algorithm: one backward and recursive and one forward version. In [19], the authors proposed the use of what they call value of changed information. Their approach uses MDPs focusing on changes of the state transition function, in order to anticipate values of the service parameters that do not change the WSC. In [20], a combination of MDPs and HTN (Hierarchical Task Network) planning is proposed.
Solutions based on reinforcement learning are also relevant. For instance, in [21], reinforcement learning and preference logic were employed together to solve the WSC problem, obtaining some kind of qualitative solution. Authors argue that computing a qualitative solution has many advantages over a quantitative one. Other methods using Qlearning are given in [22–24]. It is important to remember that reinforcement learning methods [25] belong to a family of algorithms highly related to the MDPs. The main difference with these methods is that the state transition function is assumed to be unknown and therefore the agents need to explore their state and action spaces by executing different actions in different states and observe the numerical rewards obtained after each state transition.
1.2. Contributions of This Paper
The goal of automatic WSC is to determine a sequence of Web services that can be combined to satisfy a set of predefined QoS constraints. For problems where we need to find the sequence of actions maximizing an overall performance function, the MDPs are one of the most robust mathematical tools that we can use. Therefore, in this paper we propose an MDP model to solve the WSC problem. To show the reliability of our model, we conducted experiments with three of the most studied algorithms: policy iteration, iterative policy evaluation, and value iteration. Although all three algorithms provided good solutions, the policy iteration algorithm required the minimum number of iterations to converge to the optimal solutions. We also compared these three algorithms against sarsa and Qlearning, showing that the latter methods require one or two orders of magnitude and more time to solve composition problems of the same complexity.
This paper is structured as follows. Section 2 provides the basics of the MDPs framework and introduces the three algorithms that we tested. Section 3 introduces our MDP model for solving the WSC problem. Section 4 describes the experimental setup and presents the most relevant results. Section 5 presents comparative experiments with sarsa and Qlearning algorithms. Finally, Section 6 concludes this paper by discussing the main findings and providing some advice for future research.
2. Markov Decision Processes
The WSC problem can be abstracted as the problem of selecting a sequence of actions, in such a way that we maximize an overall evaluation function. Such kind of sequential decision problems can be defined and solved in an MDP framework. An MDP is a tuple , where is a set of states, is a set of actions, are the state transition probabilities for all states and actions , is a discount factor, and is the reward function.
The MDP dynamics is the following. An agent in state performs an action selected from the set of actions . As a result of performing action , the agent receives a reward with expected value and the current state of the MDP transitions to some successor state , according to the transition probability . Once in state the agent chooses and executes an action , receiving reward and moving to state . The agent keeps choosing and executing actions, creating a path of visited states .
As the agent goes through states, , it obtains the following rewards:
The reward at timestep is discounted by a factor of . By doing so, the agent gives more importance to those rewards obtained sooner. In an MDP we try to maximize the sum of expected rewards obtained by the agent:
A policy is defined as a function mapping from the states to the actions. A value function for a policy is the expected sum of discounted rewards, obtained by performing always the actions provided by :
is the expected sum of discounted rewards that the agent would receive if it starts in state and takes actions given by . Given a fixed policy , its value function satisfies the Bellman equation:
The optimal value function is defined as
This function gives the best possible expected sum of discounted rewards that can be obtained using any policy . The Bellman equation for the optimal value function is
The optimal value function is such that we have
2.1. Dynamic Programming Algorithms for MDPs
When the state transition probabilities are known, dynamic programming can be used to solve (6). Next, we present three efficient algorithms for solving finitestate MDPs by means of dynamic programming. The first one is the iterative policy evaluation (given in Algorithm 1). The second one is the policy value iteration algorithm (given in Algorithm 2). This algorithm repeatedly computes the value function for the current policy and then updates the policy using the current value function. The third one, shown in Algorithm 3, called value function iteration, can be thought as an iterative update of the estimated value function using Bellman Equation (6).



The last two algorithms are known to converge usually faster than the first one. Moreover policy iteration and value iteration are standard algorithms for solving MDPs, and there is not currently universal agreement over which algorithm is better [26, 27].
3. Web Service Composition Model
In this section we define the MDP model used to represent and solve the Web service composition problem by means of dynamic programming algorithms.
We begin by describing the WSC problem in more details. Individual Web services can be categorized in classes by their functionality, input data, and output data. Given different classes of individual Web services, the WSC problem consists in finding a sequence of length of individual Web services , such that , for , where is the set of all available Web services of class . Thus, we are making the assumption that a valid composite Web service needs a Web service from each of the existing classes. We are also making the assumptions that all available Web services have been previously categorized into classes and that the ordering of the classes has been predefined. means that a Web service from set must be executed before a Web service from set to ensure the correct operation of the selected Web services. The correct operation depends basically on their functionality and input and output data. Therefore, the output of must be fully compatible with the input of .
Now, we are ready to introduce our model. We define a Web service composition problem as an MDP , where is the set of states, is the set of actions, is the state transition probability function, is a discount factor such that , and is the reward function. Elements , , , and are defined next.
3.1. States
is the set of states. Given a WSC problem with classes, consists of all compositions of length at most . Thus, for , , with . A composition of length is not really a composition; it is just a single Web service; however, we will relax the meaning of the word composition and will call it a composition of length . For , , with and . For , , with , , and . In general, for a WSC problem with classes .
3.2. Actions
is the set of all actions. Given a state , the set of actions available from is denoted by ; thus . An action consists of selecting a Web service to be included in the current composition. If the current composition is of length , all the possibilities of selecting a Web service of class will constitute the set of current available actions.
Formally, we say that , where is read as the set of actions available from a state representing a composition of length . Note that refers to set of actions available from a composition of length , which corresponds to the state where none of the Web services has been selected yet.
For example, if the current state represents the composition which is of length , then is given by all the possibilities of selecting a Web service of class . In other words, we are in a situation where we have already selected Web services from class and class , and now we need to select a Web service from class .
3.3. Transition Probabilities
are the state transition probabilities for all states and actions , which are currently available from and . Note that the probability of going from a state to the state is 1. Meanwhile, the probability of going from the same state to a state is 0. In other words, we can only go from a composition state of length to another composition state of length .
3.4. Reward Function
is the reward received when action is executed and the environment makes a transition from to . The reward function for our model is computed using three QoS attributes, as indicated in (8), which was originally proposed in [22]. The QoS employed are availability, throughput, and execution time:where , , are the availability, average execution time, and throughput values for the last Web service added to the composition represented by state . , , and , , and are the minimum and maximum values for all the Web services.
4. Experimental Evaluation
In this section we provide the results of our experimental comparison using two scenarios, one real and one artificial. The experiments that we present in this section were performed running policy iteration, iterative policy iteration, and value iteration algorithms, on an Intel Core i5 2.5 GHz processor, on Windows 8.1, 64 bits operating system, and 6 GB RAM.
4.1. Real Scenario
The WSC problem considered as our first experimental scenario consists of 2 classes of Web services. One class is about weather services that can be used to obtain the current temperature in a city. The other class is about Web services that can be used to convert temperatures from one metric unit to another, for example, from Fahrenheit to Celsius. In the class of weather services we considered 3 different Web services.(i)National Oceanic and Atmospheric Administration (NOAA) Web service, available at http://graphical.weather.gov/xml/SOAP_server/ndfdXMLserver.php.(ii)GlobalWeather Web service, available at http://www.webservicex.net/globalweather.asmx.(iii)Weather channel Web service, available at http://api.wunderground.com/.
In the class of metric units conversion services we considered 4 different Web services.(i)A simple calculator Web service such as the one available at http://www.dneonline.com/calculator.asmx. Sincewe can use subtraction, multiplication, and division operations for the temperature conversion.(ii)ConvertTemperature Web service, available at http://www.webservicex.net/ConvertTemperature.asmx.(iii)TemperatureConversions Web service, available at http://webservices.daehosting.com/services/TemperatureConversions.wso.(iv)TempConvert Web service, available at http://www.w3schools.com/webservices/tempconvert.asmx.
We obtained the QoS attribute values of all 7 Web services using a java program designed to get the attribute values with the following formulas:where is the number of successful calls to the Web service and are the total calls,where is the total execution time for all the calls,with .
In order to obtain representative QoS values for the Web services, we made many measurements, several days in different moments of the day. We obtained the values for each parameter and measurement, and then we calculated the average values for the QoS parameters.
Once we gathered the information of the QoS attributes we used all 3 dynamic programming algorithms to learn the best composite Web service. With 7 Web services belonging to 2 different classes, there are 12 possible compositions. All these possibilities are represented with the graph illustrated in Figure 1.
The graph of the real scenario illustrates each class of Web services as a layer. In this graph, each node represents an individual Web service. Node represents the state where none of the Web services has been selected yet. Node represents the state where a full composition of Web service has been accomplished. A path from to implies that a valid composite Web service has been generated.
Results with the real Web services scenario are plotted in Figure 2. All 3 algorithms found the solution for the Web service composition very quickly, in less than 0.07 seconds, with policy iteration being the winner.
4.2. Artificial Scenario
As our second scenario to test all 3 dynamic programming algorithms, we simulated data for three QoS attributes: availability, execution time, and throughput. We created a maximum of 100,000 individual Web services, classified into 100 hypothetical classes of Web services. We assumed that every Web service in a class can access all the Web services in class . Each of these classes is represented as a layer in Figure 3. Each layer contains 100 nodes or individual Web services.
As in the first scenario, node is the initial state of the graph and represents a state where none of the Web services has been selected yet. Node is reached when a valid composition has been accomplished. Nodes between and represent the available Web services. A route from to gives a possible composite Web service.
Results of this second set of experiments are shown in Figures 4, 5, and 6, for , , and , respectively.
Each layer in the graph represents 100 Web services belonging to the same class. Therefore, when the number of nodes to be selected for a valid Web service composition is 1,000, we are really solving a problem with 100 × 1,000 = 100,000 Web services. We can see from the learning curves that the time needed to solve the MDP problem increases as the number of nodes is increased. Again, all 3 algorithms found the optimal solution, but policy iteration found it in less time. The best performances of the algorithms were obtained for and , requiring less than 180 seconds to find the optimal composition using iterative policy evaluation and value iteration and less than 120 in the case of policy iteration.
5. Comparison with Sarsa and QLearning
In some related works [22–24], reinforcement learning algorithms were proposed to solve the Web service composition problem. In this section we compare the learning times required by sarsa and Qlearning against policy iteration, iterative policy evaluation, and value iteration.
5.1. Sarsa
Sarsa [25] is an onpolicy temporal difference control algorithm which continually estimates the stateaction value function for the behavior policy and at the same time changes toward greediness with respect to . Algorithm 4 presents the sarsa algorithm as taken from [25].

If the policy is such that each action is executed infinitely often in every state, every state is visited infinitely often, and it is greedy with respect to the current actionvalue function in the limit, then by decaying , the algorithm converges to [28].
5.2. QLearning
Qlearning [29] is an offpolicy temporal difference control algorithm which directly approximates the optimal actionvalue function, independently of the policy being followed. It is one of the most popular algorithms in reinforcement learning. Algorithm 5 reproduces the Qlearning algorithm as taken from [25].

If in the limit the actionvalues of all stateaction pairs are updated infinitely often, with a decaying , then the algorithm converges to with probability 1 [26, 30].
5.3. Learning Time Analysis
We have implemented sarsa and Qlearning algorithms to solve the real scenario problem defined previously in the experimental section. A comparison graph illustrating the time required by sarsa, Qlearning, policy iteration, iterative policy evaluation, and value iteration is given using a logarithmic scale in Figure 7. From this graph we can clearly see that sarsa and Qlearning required two orders of magnitude and more time to find the optimal composition.
Additionally, we ran experiments with a second artificially created scenario, with 3 layers of 20 Web services each. Once more, reinforcement learning methods required much more time than the dynamic programming algorithms. Logarithmic time curves given in Figure 8 show that sarsa and Qlearning required one order of magnitude and more time than dynamic programming algorithms. Furthermore, in some of the experiments, reinforcement learning algorithms failed to find the optimal solution, getting stuck in suboptimal compositions.
Dynamic programming methods converge faster than reinforcement learning methods simply because dynamic programming methods update every single state value at each iteration. Reinforcement learning methods only update the value of the states that happen to visit, giving its exploration policy, that is, epsilon greedy.
Furthermore, in terms of the deployment of an automatic Web service composition system, it is worth mentioning that the gathering of QoS information can be performed at specific time intervals by a dedicated module of such system. Once we have gathered this information, which is fundamental for the evaluation of the reward function, there is no need to explore the state space of Web services as reinforcement learning methods do. We can simply run a dynamic programming algorithm to estimate the value function of the Web services and then compute the optimal composition of Web services.
6. Conclusion
In this paper we have proposed an MDP model to address the Web service composition problem. We used three dynamic programming algorithms, namely, iterative policy evaluation, value iteration, and policy iteration, to show the reliability of our approach. Experiments were conducted with both artificially created data and a set of real data involving seven publicly available Web services.
Our experimental results show that policy iteration is the best one in terms of the minimum number of iterations needed to estimate an optimal policy. The optimal policy indicates the sequence of combined individual Web services making up a composite Web service with the highest evaluation of their QoS attributes.
Although some approaches using reinforcement learning have also been proposed, we argue that dynamic programming methods are better suited for the Web service composition problem than reinforcement learning methods. The reason is that reinforcement learning methods such as sarsa and Qlearning require a lot of exploration of the state space and consequently they need more iterations to make a good estimation of the optimal policy. To illustrate this, we compared sarsa and Qlearning against policy iteration, iterative policy evaluation, and value iteration. The result of this comparison is that sarsa and Qlearning required one or two orders of magnitude and more time than the dynamic programming methods to handle problems of the same complexity. Moreover, in some of the artificially created experiments, reinforcement learning algorithms got stuck in suboptimal Web services compositions.
None of the related works proposing the use of MDPbased methods to solve the Web service composition problem have provided a comparison study involving the five algorithms that we have analyzed in this work: iterative policy evaluation, value iteration, policy iteration, sarsa, and Qlearning. Moreover, we present experimental results using both a real scenario and a Web service composition scenario with artificially generated data. All other related works report experiments performed only with artificially created data.
Future research on this topic must address real Web services composition involving more nodes. Another interesting subject that deserves to be further investigated is the design of complex reward functions capable of handling an increasing number of QoS factors.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
The authors would also like to thank the Secretaria de Educacion of Mexico for the partial support through Grant PIFI201331MSU0098J14.
References
 W3C Working Group, Web Services Architecture, 2004, http://www.w3.org/TR/wsarch/.
 V. X. Tran and H. Tsuji, “QoS based ranking for web Services: fuzzy approaches,” in Proceedings of the 4th International Conference on Next Generation Web Services Practices (NWeSP '08), pp. 77–82, Seoul, Republic of Korea, October 2008. View at: Publisher Site  Google Scholar
 S.Y. Hwang, E.P. Lim, C.H. Lee, and C.H. Chen, “Dynamic Web service selection for reliable Web service composition,” IEEE Transactions on Services Computing, vol. 1, no. 2, pp. 104–116, 2008. View at: Publisher Site  Google Scholar
 D.H. Shin, K.H. Lee, and T. Suda, “Automated generation of composite web services based on functional semantics,” Journal of Web Semantics, vol. 7, no. 4, pp. 332–343, 2009. View at: Publisher Site  Google Scholar
 Y. Yan, P. Poizat, and L. Zhao, “Selfadaptive service composition through graphplan repair,” in Proceedings of the IEEE 8th International Conference on Web Services (ICWS '10), pp. 624–627, July 2010. View at: Publisher Site  Google Scholar
 W. Jiang, S. Hu, D. Lee, S. Gong, and Z. Liu, “Continuous query for QoSaware automatic service composition,” in Proceedings of the IEEE 19th International Conference on Web Services (ICWS '12), pp. 50–57, Honolulu, Hawaii, USA, June 2012. View at: Publisher Site  Google Scholar
 Y. Feng, A. Veeramani, and R. Kanagasabai, “Automatic DAGbased service composition: a model checking approach,” in Proceedings of the IEEE 19th International Conference on Web Services (ICWS '12), June 2012. View at: Publisher Site  Google Scholar
 Y. Yan, M. Chen, and Y. Yang, “Anytime QoS optimization over the PlanGraph for web service composition,” in Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC '12), pp. 1968–1975, March 2012. View at: Publisher Site  Google Scholar
 L. Zeng, B. Benatallah, A. H. H. Ngu, M. Dumas, J. Kalagnanam, and H. Chang, “QoSaware middleware for Web services composition,” IEEE Transactions on Software Engineering, vol. 30, no. 5, pp. 311–327, 2004. View at: Publisher Site  Google Scholar
 D. Ardagna and B. Pernici, “Adaptive service composition in flexible processes,” IEEE Transactions on Software Engineering, vol. 33, no. 6, pp. 369–384, 2007. View at: Publisher Site  Google Scholar
 T. Yu, Y. Zhang, and K.J. Lin, “Efficient algorithms for Web services selection with endtoend QoS constraints,” ACM Transactions on the Web, vol. 1, no. 1, article 6, 2007. View at: Publisher Site  Google Scholar
 S.C. Oh, D. Lee, and S. R. T. Kumara, “Effective Web service composition in diverse and largescale service networks,” IEEE Transactions on Services Computing, vol. 1, no. 1, pp. 15–32, 2008. View at: Publisher Site  Google Scholar
 Y. Bo and Q. Zheng, “Semantic web service composition using graphplan,” in Proceedings of the 4th IEEE Conference on Industrial Electronics and Applications (ICIEA '09), pp. 459–463, Xi'an, China, May 2009. View at: Publisher Site  Google Scholar
 P. RodriguezMier, M. Mucientes, and M. Lama, “Automatic web service composition with a heuristicbased search algorithm,” in Proceedings of the IEEE 9th International Conference on Web Services (ICWS '11), pp. 81–88, July 2011. View at: Publisher Site  Google Scholar
 F. Qiqing, P. Xiaoming, L. Qinghua, and H. Yahui, “A global QoS optimizing web services selection algorithm based on MOACO for dynamic web service composition,” in Proceedings of the International Forum on Information Technology and Applications (IFITA '09), pp. 37–42, Chengdu, China, May 2009. View at: Publisher Site  Google Scholar
 M. Oh, J. Baik, S. Kang, and H.J. Choi, “An efficient approach for QoSaware service selection based on a treebased algorithm,” in Proceedings of the 17th IEEE/ACIS International Conference on Computer and Information Science (ICIS '08), pp. 605–610, IEEE, Portland, Ore, USA, May 2008. View at: Publisher Site  Google Scholar
 P. Doshi, R. Goodwin, R. Akkiraju, and K. Verma, “Dynamic workflow composition using Markov decision processes,” in Proceedings of the IEEE International Conference on Web Services (ICWS '04), pp. 576–582, July 2004. View at: Publisher Site  Google Scholar
 A. Gao, D. Yang, S. Tang, and M. Zhang, “Web service composition using Markov decision processes,” in Advances in WebAge Information Management: Proceedings 6th International Conference, WAIM 2005, Hangzhou, China, October 11–13, 2005, vol. 3739 of Lecture Notes in Computer Science, pp. 308–319, Springer, Berlin, Germany, 2005. View at: Publisher Site  Google Scholar
 J. Harney and P. Doshi, “Selective querying for adapting web service compositions using the value of changed information,” IEEE Transactions on Services Computing, vol. 1, no. 3, pp. 169–185, 2008. View at: Publisher Site  Google Scholar
 K. Chen, J. Xu, and S. ReiffMarganiec, “MarkovHTN planning approach to enhance flexibility of automatic web service composition,” in Proceedings of the IEEE International Conference on Web Services (ICWS '09), pp. 9–16, Los Angeles, Calif, USA, July 2009. View at: Publisher Site  Google Scholar
 H. Wang, P. Tang, and P. Hung, “RLPLA: A reinforcement learning algorithm of web service composition with preference consideration,” in Proceedings of the IEEE Congress on Services Part II, 2008. View at: Google Scholar
 H. Wang, X. Zhouy, X. Zhou, W. Liu, and W. Li, “Adaptive and dynamic service composition using Qlearning,” in Proceedings of the 22nd International Conference on Tools with Artificial Intelligence (ICTAI '10), pp. 145–152, Arras, France, October 2010. View at: Publisher Site  Google Scholar
 V. Todica, M.F. Vaida, and M. Cremene, “Formal verification in web services composition,” in Proceedings of the 18th IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR '12), pp. 195–200, May 2012. View at: Publisher Site  Google Scholar
 L. Yu, W. Zhili, L. Meng, W. Jiang, and X.S. Qiu, “Adaptive web services composition using Qlearning in cloud,” in Proceedings of the 9th IEEE World Congress on Services (SERVICES '13), pp. 393–396, Santa Clara, Calif, USA, July 2013. View at: Publisher Site  Google Scholar
 R. S. Sutton and A. G. Barto, Reinforcement Learning An Introduction, The MIT Press, Cambridge, Mass, USA, 1998.
 D. P. Bertsekas and J. N. Tsitsiklis, NeuroDynamic Programming, Athena Scientific, 1996.
 M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, WileyInterscience, 1994. View at: MathSciNet
 S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergence results for singlestep onpolicy reinforcementlearning algorithms,” Machine Learning, vol. 38, no. 3, pp. 287–308, 2000. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 C. Watkins, Learning from delayed rewards [Ph.D. thesis], University of Cambridge, 1989.
 T. Jaakkola, M. I. Jordan, and S. Singh, “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Computation, vol. 6, pp. 1185–1201, 1994. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2015 Víctor UcCetina et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.