## Coordinated Control and Estimation of Multiagent Systems with Engineering Applications

View this Special IssueResearch Article | Open Access

Wei Zeng, Hongtao Zhou, Mingshan You, "Risk-Sensitive Multiagent Decision-Theoretic Planning Based on MDP and One-Switch Utility Functions", *Mathematical Problems in Engineering*, vol. 2014, Article ID 697895, 11 pages, 2014. https://doi.org/10.1155/2014/697895

# Risk-Sensitive Multiagent Decision-Theoretic Planning Based on MDP and One-Switch Utility Functions

**Academic Editor:**Wei Zhang

#### Abstract

In high stakes situations decision-makers are often risk-averse and decision-making processes often take place in group settings. This paper studies multiagent decision-theoretic planning under Markov decision processes (MDPs) framework with considering the change of agent’s risk attitude as his wealth level varies. Based on one-switch utility function that describes agent’s risk attitude change with his wealth level, we give the additive and multiplicative aggregation models of group utility and adopt maximizing expected group utility as planning objective. When the wealth level approaches infinity, the characteristics of optimal policy are analyzed for the additive and multiplicative aggregation model, respectively. Then a backward-induction method is proposed to divide the wealth level interval from negative infinity to initial wealth level into subintervals and determine the optimal policy in states and subintervals. The proposed method is illustrated by numerical examples and the influences of agent’s risk aversion parameters and weights on group decision-making are also analyzed.

#### 1. Introduction

Decision-theoretic planning is to compute optimal policy that is formed by courses of action to maximize expected reward with considering actions that have uncertain outcomes [1]. In high stakes situations with the possibility of high wins and losses, such as emergency and crisis response, business and investment decision, military battle, and lottery, decision-makers are often risk-averse. In the risk-sensitive decisions, exponential utility function is one of the typical utility functions to model decision-maker’s risk aversion and maximizing expected utility is the most commonly used rule instead of maximizing expected reward. However, the risk attitude of a decision-maker modeled by exponential utility function is independent of his wealth level and does not change as his wealth level varies, while in reality personal risk attitude often changes with his wealth level [2–4]. Bell proposes a kind of utility function, named one-switch utility function, to model an agent who is always risk-averse but becomes risk neutral as his wealth increases [2]. Bell and Fishburn take further studies on the characteristics of a form of one-switch utility function that is a combination of linear utility function and exponential utility function [5]. Liu and Koenig give a form of one-switch utility function with considering agent’s risk aversion parameter ; that is, , where denotes the wealth level and is a parameter to adjust tradeoff between risk neutrality and risk aversion. This form of one-switch utility function not only describes the change of agent’s risk attitude, but also presents the degree of agent’s risk aversion by the quantitative risk aversion parameter [6–8].

For decision-theoretic planning problems, Markov decision processes (MDPs) framework is adopted broadly as an underlying model. Howard and Matheson in their seminal paper introduce risk-sensitive MDPs based on maximizing the expected exponential utility [9]. In the follow-up related studies structural properties of optimal solution and algorithms to compute optimal policy are investigated based on exponential utility function [10–12]. If an agent is risk-sensitive, it is necessary to consider the possible change of agent’s risk attitude when his wealth level varies and further influence on the decisions in the next stage. Liu and Koenig study Markov decision processes with considering agent’s risk-sensitive attitude modeled by one-switch utility function and propose an exact backward-induction algorithm to compute optimal policy [8].

In reality decision-making processes often take place in group settings due to a single decision-maker’s limited decision-making ability. For the group decision-making problem, group utility is usually got by aggregating personal utilities and then group decisions are made based on the group utility. The aggregation methods include additive value model and multiplicative value rule. Other methods such as multiobjective linear programming [13], fuzzy sets method [14, 15], and interactive approach [16, 17] are used to aggregate individual decision information including attribute weights and attribute values into group decisions. Besides, some researches on group decision-making problem take time into consideration. Xu investigates multistage multiattribute group decision-making problems in which the weight information on a collection of attributes and the decision information on a finite set of alternatives with respect to the attributes are collected at different stages [18].

This paper focuses on decision-theoretic planning problem in which sequential decisions are made by a group of risk-sensitive members. Considering agent’s risk-sensitive attitude and wealth level, this paper studies the risk-sensitive multiagent decision-theoretic planning problem based on one-switch utility function and MDP framework. Two group utility functions based, respectively, on additive value model and multiplicative value model of one-switch utility functions are given. Backward-induction algorithms for these two kinds of group utility functions to compute optimal policy of risk-sensitive group decision-making under MDP framework are proposed.

The rest of this paper is organized as follows. One-switch utility function and risk-sensitive MDP model augmented with wealth level are introduced in Section 2. In Section 3, additive and multiplicative aggregation model of one-switch utility functions are given. We analyze the characteristics of optimal policy when the wealth level approaches negative infinity for additive and multiplicative aggregation model, respectively, in Section 3. In Section 4, detailed backward-induction algorithms are proposed to solve the multiagent decision-theoretic planning problem based, respectively, on additive and multiplicative aggregation model. Numerical examples are used to illustrate the proposed method and analyze the influences of agent’s risk aversion parameters and weights on group decision-making in Section 5. Finally, a conclusion of this paper and suggested topics for future research are presented in Section 6.

#### 2. Risk-Sensitive MDP Model Augmented with Wealth Level

##### 2.1. One-Switch Utility Function

One-switch utility function is a kind of utility function to describe the change of agent’s risk attitude as his wealth level varies. In detail, there exists a wealth level ; when the agent’s wealth level is below , the agent is risk-averse, but when his wealth level increases and becomes higher than , the agent becomes risk neutral. For agent , one-switch utility function given by Liu and Koenig is shown as follows [6–8]: where is wealth level, is agent ’s risk aversion parameter, and . is a constant that provides an adjustable tradeoff between risk neutrality (linear term) and risk aversion (exponential term). is a linear utility function. is agent ’s exponential utility function.

##### 2.2. Risk-Sensitive MDP Model Augmented with Wealth Level

In the paper goal directed Markov decision problem (GDMDP) is adopted as underlying model of decision-theoretic planning problem [8]. GDMDP is a kind of MDP with a finite set of goal states. When an agent reaches a goal state, he stops acting and receives no more rewards thereafter. One-switch utility function is used to describe the agent’s risk-sensitive attitude and maximizing expected utility is adopted as planning objective instead of maximizing expected reward. As wealth level is included in the one-switch utility function, it is necessary to consider the wealth level as a component of the system state of GDMDP.

Formally, a GDMDP consists of a finite set of states with wealth levels , so the augmented state set of GDMDP is denoted by . Goal state set is , where . Nongoal state set is , where .

The agent’s action set is . The agent chooses an action to execute in its current state .

The agent’s execution of action in state results in a finite reward and a transition to successor state with probability . In the paper only cost is considered and assumed reward is strictly negative, .

We also use and to denote the state and action at time step . is used to denote the reward for executing action . After the agent reaches a goal state, .

is the agent’s wealth level at time step , where the initial wealth level is denoted by .

For the MDP model augmented with wealth level, the optimal policy maps every combination of a state and wealth level to an action that an agent in state with wealth level should execute to maximize expected utility.

For agent ’s exponential utility function and all policies , we define the value as the expected exponential utility of agent with initial state and initial wealth level that follows policy .

The optimal value is defined as the highest possible expected exponential utility of agent with initial state and initial wealth level . Assume is finite for all state and wealth levels .

An optimal policy is defined as a policy with for all state and wealth levels .

Similarly, for linear utility function , we use , , and to denote expected utility, optimal value, and optimal policy, respectively.

It is worth noting that differently from Liu and Koenig [8], the paper focuses on the decision-making in group setting. So the value function in MDP will be replaced by group utility function which is the aggregation of personal one-switch utilities and the planning objective is to maximize the expected group utility.

#### 3. Utility Aggregation Model of One-Switch Utility Functions

Group utility is the aggregation of personal utilities. The common methods include additive value model and multiplicative value model. In the following sections we will discuss additive and multiplicative value model for the aggregation of personal one-switch utility functions, respectively.

##### 3.1. Additive Aggregation Model of One-Switch Utility Functions

In general, additive aggregation model of group utility is defined as follows: where is the weight of agent ’s utility, , and

Thus the additive aggregation model of one-switch utility functions is defined as follows: According to MDP, the expected group utility for all policies and state is presented as follows: Then, the optimal value is presented as follows: Next, we will derive the relationship between the expected group utility and the expected personal linear and exponential utility for the additive aggregation model.

For all policies , the expected group utility of the additive aggregation model of one-switch utility functions is presented as follows: where and satisfy the following policy-evaluation equations, respectively, [8, 19, 20]: From the above policy-evaluation equation (6), we obtain the relationship between and expected linear utility and expected personal exponential utility for the additive aggregation model, where and are independent of the wealth level .

##### 3.2. Multiplicative Aggregation Model of One-Switch Utility Functions

In the paper we adopt the following multiplicative aggregation model of group utility: where are constants and .

For simplicity, in the paper we only consider the case . For , the derivation of multiplicative aggregation model is similar.

For , multiplicative aggregation model of one-switch utility functions is simplified as follows: According to MDP, the optimal value is presented as follows: For the multiplicative aggregation model of one-switch utility functions and all policies , the expected group utility is According to the fact that [8], we have From the above policy-evaluation equation (12), we obtain the relationship between expected group utility and expected linear utility and expected personal exponential utility for the multiplicative aggregation model.

#### 4. Preparation for Backward-Induction Method

To solve the optimal policy of the additive and multiplicative aggregation model of one-switch utility functions, backward-induction method is adopted. In the paper the value range of wealth level is a continuous interval ; we first compute the optimal policy when wealth level that is represented by . Then increase the wealth level until is no longer an optimal policy and get a wealth level threshold. Increase further the wealth level and get the next wealth level threshold similarly. The backward-induction method ends when the wealth level is larger than initial wealth level . Thus the continuous wealth level interval is divided into subintervals by the thresholds. We use to denote action executed in state and wealth level interval ( denotes a wealth level threshold and ). In this section we will analyze the characteristics of optimal policy when the wealth level approaches negative infinity for additive and multiplicative aggregation model, respectively.

Lemma 1. *For additive aggregation model of one-switch utility functions, if agent is the most risk-averse one, that is, , for any , , then
*

*Proof. *For all optimal policies , we have
As , , we can derive .

Thus .

On the other hand, for all optimal policies , according to the fact that for all policies , we have
Therefore, the lemma holds.

Lemma 1 implies that the optimal policy for the additive aggregation model of one-switch utility functions is the same as the most risk-averse agent’s optimal policy for the exponential utility function as the wealth level .

Lemma 2. *For multiplicative aggregation model of one-switch utility functions, for all states and , where is the highest expected exponential utility with risk aversion parameter .*

*Proof. *For all optimal policies , we have
As , ,

Then , .

Therefore,
On the other hand, for all optimal policies , according to the fact that for all policies , we have
Therefore, the lemma holds.

Lemma 2 implies that the optimal policy for the multiplicative aggregation of one-switch utility functions is the same as the optimal policy for a virtual agent’s exponential utility function as the wealth level . The virtual agent’s risk aversion parameter is the product of every agent’s risk aversion parameter in the group.

#### 5. Division of Wealth Level Interval and Backward-Induction Method

The above section gives the optimal policy as the wealth level approaches negative infinity for additive and multiplicative aggregation model of one-switch utility functions, respectively. The next step is to divide the wealth level interval and determine the wealth level thresholds and optimal policies in the intervals by using backward-induction method. In this section we will discuss the backward-induction method in the cases of additive and multiplicative aggregation model.

##### 5.1. The Case of Additive Aggregation Model

For the additive aggregation model of one-switch utility functions, we first give the following theorem to prove the existence of a wealth level threshold and then give the backward-induction algorithm.

Theorem 3. *For all optimal policies , there exists a wealth level threshold such that it holds for all states and all wealth levels that
**
Please see Appendix A for the proof of Theorem 3.*

Theorem 3 shows the existence of the wealth level threshold . Next we will show how to determine the wealth level threshold .

After getting when wealth level , assume is the optimal policy for the wealth level interval ; then for wealth level in the interval , where is the reward got by executing one step action, is no longer the optimal policy, and assume is the optimal policy; according to (5); for all nongoal states, we have As , where For , , because the optimal policy is , not now under assumption, we have or, equivalently, We can get a wealth level threshold in equality case of the above weak inequality.

From the algorithm above, we can get the wealth level threshold : After getting , the next step is to divide further the wealth level interval into subintervals and solve the optimal policy for each subinterval similarly to the above algorithm. The main procedure of the backward-induction algorithm for group decision-making in the case of additive aggregation model is listed as follows.

*Step 1. *By maximizing the expected exponential utility, get the optimal policy of the most risk-averse agent when .

*Step 2. *According to (6), we have
where ; get the optimal value and the values of for all states when .

*Step 3. *For all states , , get the values of by (7); then get by Expression (24).

*Step 4. *Calculate the wealth level threshold according to (25).

*Step 5. *For the wealth level interval , increase further the wealth level according to the reward got by executing one step action, and determine the wealth level threshold and optimal policy similarly to the above steps.

*Step 6. *If, for all , the wealth level is larger than , then end the algorithm.

##### 5.2. The Case of Multiplicative Aggregation Model

For the multiplicative aggregation model of one-switch utility functions, we also have the following theorem that shows the existence of a wealth level threshold .

Theorem 4. *For all optimal policies , there exists a wealth threshold . For any state , wealth level ,
**
Please see Appendix B for the proof of Theorem 4.*

Similarly to additive aggregation model, we determine the wealth level threshold for the multiplicative aggregation model. Assume that is the optimal policy for the wealth level interval ; then for the wealth level in the interval , is no longer the optimal policy, and assuming is the optimal policy, according to (10), for all nongoal states, we have As , For , because the optimal policy is , not now under assumption, we have or, equivalently, We can get a wealth level point in equality case of the above weak inequality.

Then, we can get the wealth level threshold according to (25).

After getting , the next step is to divide further the interval into subintervals and compute the optimal policy for each subinterval. The main procedure of the backward-induction algorithm for group decision-making in the case of multiplicative aggregation model is listed as follows.

*Step 1. *By maximizing the expected exponential utility, get the optimal policy of the visual agent when .

*Step 2. *According to (12), we have
where ; get the optimal value and the values of , , for all states when .

*Step 3. *For all states , , get the values of , , by (7); then get by Expression (31).

*Step 4. *Calculate the wealth level threshold according to (25).

*Step 5. *For the wealth level interval , increase further the wealth level according to the reward got by executing one step action and determine the wealth level threshold and optimal policy similarly to the above steps.

*Step 6. *If, for all , the wealth level is larger than , then end the algorithm.

#### 6. Numerical Examples

Consider a simple GDMDP model. There are two agents named Agent_{1} and Agent_{2} with risk aversion parameters and , respectively. The state set of the GDMDP model includes initial state and goal state . Agent’s action set is . In state executing action results in a finite reward and a transition to goal state with possibility . When agent reaches the goal state it stops acting and receives no more rewards thereafter. Figure 1 shows the transitions of system states. Agents need to make an optimal policy together to reach the goal state.

Without loss of generality, the GDMDP model’s parameters are assumed as follows: , , , and ; the one-switch utility functions of two agents are defined as and with and , respectively. Initial wealth level of each agent is set 0.

First, consider the situation that each agent makes decisions alone. The agent’s optimal policy and the wealth level threshold are solved by utilizing the method proposed by Liu and Koenig [8]. The results are shown as follows:
Next, we consider the group decision-making based on additive and multiplicative aggregation model of one-switch utility functions. In the case of additive aggregation model, assume each agent has equal weight; that is, ; then the optimal policy and the wealth level threshold based on the proposed method in the paper are solved as follows:
The above result shows that, in the wealth level interval (−106.5, −16.4), if Agent_{1} makes decisions alone, the optimal policy is taking action in state but if Agent_{2} makes decisions alone. If they make decisions together, then action is taken.

Now we consider how the wealth level threshold of group decision-making changes as the weights of agents vary. In detail, changes from 0.05 to 0.95; meanwhile changes from 0.95 to 0.05. The change of the wealth level threshold of group decision-making is shown in Figure 2.

Figure 2 shows that the values of the wealth level threshold of group decision-making are near to the wealth level threshold of Agent_{2} who is more risk-averse even if is small and is large. This means that the influence of weights on group decision-making is not obvious if the risk aversion parameters of agents are different, while the risk aversion parameters play an important role in this situation.

Consider the situation that two agents have similar risk attitude; that is, their risk aversion parameters are similar; for example, their one-switch utility functions are and , respectively. When each agent makes decisions alone their optimal policies and the wealth level thresholds are solved as follows:
Change the values of weights in the same way as Figure 2; the result is shown in Figure 3. Difference from Figures 2 and 3 shows that the values of the wealth level threshold of group decision-making are near to the wealth level threshold of Agent_{1} when is small and is large. So if the difference between the risk aversion parameters of agents is not obvious, the weights of agents will play a critical role.

Finally, we consider group decision-making based on the multiplicative aggregation model and especially focus on the influence of product term of group utility, that is, , on the group decision-making. Given the same one-switch utility functions of agents in Figure 3 and assuming two sets of and value, in detail, , and , . If the value of is changed from −0.001 to −0.05, we get two curved lines of wealth level threshold of group decision-making in Figure 4.

The two curved lines gradually approach each other to a point when the absolute value of increases from 0.001 to 0.05. If compared with the result of additive aggregation model, we can find that the point is very close to the wealth level threshold of additive aggregation model with . This is because when the absolute value of is small the absolute values of and in group utility are larger than the absolute value of , so has little influence on group decision-making. When the absolute value of increases enough will mainly influence the group decision-making; furthermore, has same influence on Agent_{1} and Agent_{2}; therefore the wealth level threshold of multiplicative aggregation model will approach the threshold of additive aggregation model with = . This implies that the multiplicative aggregation model avoids group decision-making being dominated by the weights of individuals completely.

#### 7. Conclusion and Future Works

This paper has put an effort on how to extend a single agent’s risk-sensitive decision-theoretic planning under the MDP framework to the multiagent problem. Based on one-switch utility function that is used to describe agent’s risk-sensitive attitude, the additive and multiplicative aggregation models of group utility have been proposed in this paper. According to the characteristics of group utility, a backward-induction method has been presented to divide the wealth level interval and compute the optimal policy. The paper has also offered numerical examples and discussed how the weights and risk aversion parameters influence the group decision-making. From numerical examples we can observe that, for the additive aggregation model, if the risk aversion parameters of agents are different, the risk aversion parameters will have an obvious influence on the group decision-making, while the weights of agents will play a critical role if the risk aversion parameters are similar. For the multiplicative aggregation model, group decision-making will not be dominated by the weights of individuals completely. The product term of group utility will also influence the group decision-making.

In the future we intend to further study multiattribute group decision-making under the MDP framework with one-switch utility function. Based on the work of Tsetlin and Winkler [21], we will further study how to extend our method to the group decision-making problem.

#### Appendices

#### A. Proof of Theorem 3

*Proof. *Let ,
As , and
thus there exists a wealth level , for wealth level , ,
On the other hand, for all wealth levels , , let ; then .

Assume that there exists a state ; does not belong to ; then
According to the fact that , , and are all less than 0, we have
Therefore,
This is contradictory to (A.3). Thus for , , optimal action .

Additionally, as , and , then
Therefore, for , , .

#### B. Proof of Theorem 4

*Proof. *Let ,
As , and
there exists a wealth level , for all , ,
On the other hand, for all , let
then .

Assume that there exists some state ; then
According to the fact that , , and are all less than 0, we have
Therefore,
This is contradictory to (B.3). So for all , , optimal policy .

Additionally, as , and , so
Therefore, for all , , .

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant 70971048.

#### References

- C. Boutilier, T. Dean, and S. Hanks, “Decision-theoretic planning: structural assumptions and computational leverage,”
*Journal of Artificial Intelligence Research*, vol. 11, pp. 1–94, 1999. View at: Google Scholar | MathSciNet - D. E. Bell, “One-switch utility functions and a measure of risk,”
*Management Science*, vol. 34, no. 12, pp. 1416–1424, 1988. View at: Publisher Site | Google Scholar | MathSciNet - G. M. Gelles and D. W. Mitchell, “Broadly decreasing risk aversion,”
*Management Science*, vol. 45, no. 10, pp. 1432–1439, 1999. View at: Publisher Site | Google Scholar - Y. Nakamura, “Sumex utility functions,”
*Mathematical Social Sciences*, vol. 31, no. 1, pp. 39–47, 1996. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - D. E. Bell and P. C. Fishburn, “Strong one-switch utility,”
*Management Science*, vol. 47, no. 4, pp. 601–604, 2001. View at: Publisher Site | Google Scholar - Y. Liu,
*Decision-theoretic planning under risk-sensitive planning objectives [Ph.D. thesis]*, College of Computing, Georgia Institute of Technology, 2005. - Y. Liu and S. Koenig, “Risk-sensitive planning with one-switch utility functions: value iteration,” in
*Proceedings of the 20th National Conference on Artificial Intelligence*, pp. 993–999, 2005. View at: Google Scholar - Y. Liu and S. Koenig, “An exact algorithm for solving MDPs under risk-sensitive planning objectives with one-switch utility functions,” in
*Proceedings of the 7th International Conference on Autonomous Agents and Multi-Agent Systems*, 2008. View at: Google Scholar - R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,”
*Management Science*, vol. 18, no. 7, pp. 356–369, 1972. View at: Google Scholar | MathSciNet - S. D. Patek, “On terminating Markov decision processes with a risk-averse objective function,”
*Automatica*, vol. 37, no. 9, pp. 1379–1386, 2001. View at: Publisher Site | Google Scholar - D. Hernandez-Hernández and S. I. Marcus, “Risk sensitive control of Markov processes in countable state space,”
*Systems and Control Letters*, vol. 29, no. 3, pp. 147–155, 1996. View at: Publisher Site | Google Scholar | MathSciNet - Y. Le and L. Tallec,
*Robust, risk-sensitive, and data-driven control of Markov decision processes [Ph.D dissertation]*, Sloan School of Management , Massachusetts Institute of Technology, 2007. - P. H. Iz, “Two multiple criteria group decision support systems based on mathematical programming and ranking methods,”
*European Journal of Operational Research*, vol. 61, no. 1-2, pp. 245–253, 1992. View at: Publisher Site | Google Scholar | Zentralblatt MATH - S. J. Chen and C. L. Hwang,
*Fuzzy Multiple Attribute Decision Making: Methods and Applications*, vol. 375 of*Lecture Notes in Economics and Mathematical Systems*, Springer, New York, NY, USA, 1992. View at: Publisher Site | MathSciNet - R. J. Li, “Fuzzy method in group decision making,”
*Computers & Mathematics with Applications*, vol. 38, no. 1, pp. 91–101, 1999. View at: Publisher Site | Google Scholar | MathSciNet - R. Benayoun, J. de Montgolfier, and J. Tergny, “Linear programming with multiple objective functions: step method (stem),”
*Mathematical Programming*, vol. 1, no. 1, pp. 366–375, 1971. View at: Publisher Site | Google Scholar | MathSciNet - A. M. Geoffrion, J. S. Dyer, and A. Feinberg, “An interactive approach for multi-criterion optimization, with an application to the operation of academic department,”
*Management Science*, vol. 19, no. 4, pp. 357–368, 1972. View at: Google Scholar - Z. Xu, “Approaches to multi-stage multi-attribute group decision making,”
*International Journal of Information Technology and Decision Making*, vol. 10, no. 1, pp. 121–146, 2011. View at: Publisher Site | Google Scholar | Zentralblatt MATH - D. P. Bertsekas and J. N. Tsitsiklis, “An analysis of stochastic shortest path problems,”
*Mathematics of Operations Research*, vol. 16, no. 3, pp. 580–595, 1991. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - S. D. Patek, “On terminating Markov decision processes with a risk-averse objective function,”
*Automatica*, vol. 37, no. 9, pp. 1379–1386, 2001. View at: Publisher Site | Google Scholar | Zentralblatt MATH - I. Tsetlin and R. L. Winkler, “Multiattribute one-switch utility,”
*Management Science*, vol. 58, no. 3, pp. 602–605, 2012. View at: Publisher Site | Google Scholar

#### Copyright

Copyright © 2014 Wei Zeng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.