Abstract
As a continuous state space problem, air combat is difficult to be resolved by traditional dynamic programming (DP) with discretized state space. The approximated dynamic programming (ADP) approach is studied in this paper to build a high performance decision model for air combat in 1 versus 1 scenario, in which the iterative process for policy improvement is replaced by mass sampling from history trajectories and utility function approximating, leading to high efficiency on policy improvement eventually. A continuous reward function is also constructed to better guide the plane to find its way to “winner” state from any initial situation. According to our experiments, the plane is more offensive when following policy derived from ADP approach other than the baseline MinMax policy, in which the “time to win” is reduced greatly but the cumulated probability of being killed by enemy is higher. The reason is analyzed in this paper.
1. Introduction
Unmanned aerial vehicle (UAV) plays an important role in modern battlefield. For the past decades, UAV has made significant advancement on both hardware and software and achieved mission capabilities including “simple” ones like intelligence, surveillance, and reconnaissance (ISR), and “complex” ones like electronic attack, ground targets strike, suppression or destruction of enemy air defense (SEAD/DEAD), and others. According to the development roadmap [1] proposed by UAV technical leading countries, even the mission of air combat, which has been believed to be the dominated domain of human pilots due to the dynamic and complexity on tactical decisions, is possible to be carried out by autonomy UAV in the near future.
However, the decision technologies supporting automatic air combat are far from maturity. They are still on the way for better robustness, intelligence, autonomy, team cooperation, and adaption to complex environment. The methods applied in this domain include game theory [2–4], knowledgebased decision [5–9], graphic based methods like influence diagram [10], and others. Dynamic programming (DP) [11] is one of the most powerful methods for its adaptation to dynamic environment and the capability to improve policy constantly by learning [12]. However, traditional DP approach is not suitable to resolve continuous state space problem like air combat, in which the computation complex becomes intractable because of the curse of dimensionality.
In this paper, a tactical decision framework employing 5 approximated dynamic programming (ADP) method [13–15] is proposed for air combat mission. The trait of ADP method is that the utility function is learned from mass sampled states in problem space rather than from scratch, which lead to high efficiency in policy converging. As the result, ADP can be used as a second stage tool to improve the policy derived from other decision systems (denoted by the “first stage tool” for decision here), for example, a knowledgebased system. If we treat the combat traces produced in the first stage tool as the sampled states, then ADP algorithms can learn its utility function and policy from these states directly. Considering the optimizing capability of ADP inherited from DP approach, the learned policy can be improved constantly to achieve better decision performance. Thus, the merits of different decision system are combined together.
The content of this paper would be arranged as follows. In Section 2, the 1 versus 1 air combat problem is formulated with DP formation. In Section 3, the ADP method is briefly reviewed. Section 4 discusses the reward function for air combat, which is designed to guide the UAVs to enter into goal states smoothly. Some features are also specified to gain an insight of engagement situation. In Section 5, the key algorithms of ADP decision framework are proposed. The followed comparative experiments (Section 6) validate the effectiveness of the proposed framework.
2. Problem Formulation
A 1 versus 1 air combat scenario involves two opponent planes (denoted by red and blue, where the red is supposed to be “my” side). Omitting vertical movement, the kinematic equations of the plane are where is the scalar value of velocity, which is assumed to be const during the combat. is yaw angle and is defined as the deviation of velocity from north (the axis). is controlled by . is plane’s normal overload, which always points right from the gravity center of the plane and is orthogonal to velocity. In our control schema, can take a value from three options once a time: . The plane will turn counterclockwise, turn clockwise, and keep current velocity direction, respectively, with these values.
The goal of the planes is to occupy advantage position by tactical decision and gain the fire opportunities at its rival. The state space of air combat can be described with vector where subscript and refer to red and blue, respectively. Any state is an instance of . With (1), the state transition in combat space can be represented as a function which means the current state will transfer into a new state after performing and .
The goal state is reached when one plane gains opportunities to fire at its opponent. The firing position is defined by three geometrical measures:(a). Aspect angle () is a relative angle between the longitudinal symmetry axis (to the tail direction) of the target plane and the connecting line from target plane’s tail to attacking plane’s nose. refers to area where the killing probability is high when attacking from rear considering most closecombat air missiles are infrared guidance;(b). Antenna train angle () is the angle between attacking plane’s longitudinal symmetry axis and its radar’s line of sight (LOS), as Figure 1 shows. This criterion defines an area from which the target plane is difficult to escape with radar locking.(c)Relative range () between two planes: this criterion makes sure that the target plane is within the attacking range of airtoair weapon.
3. ADP Method Review
DP defines adaptive learning process and its mathematic model is Markov decision process (MDP). In DP formulation, the air combat can be described as a discrete time decision problem with fivetuples: :(1) is the problem space defined with state variable ; is the instance of ;(2) is the finite action set available in state , from which the plane selects one to execute at each decision interval. In our problem, is same for each state and thus can be simply denoted as ;(3) is the probability of transition from state to ;(4)() is the reward of state . If is visited multiple times during the combat, the rewards are discounted cumulated to form utility value of that state;(5) is the utility of state . Its value is the cumulated rewards of multiple visiting. If every state is visited adequate times, the utility distribution will converge to the optimal one, by which the optimal policy is derived.
The decision process starts from an initial state and then selects action to perform. The action interacts the environment and leads to a new state, and so on. Then, the utility of the starting state is the expectation of discounted cumulated rewards on all states following the start one: where is the discounted coefficient, making sure converges eventually. Policy is a mapping from state space to action space. For a fixed policy , the utility satisfies Bellman equation
The optimal utility is the value function that simultaneously maximizes the expected cumulative reward in all states . Bellman proved that is the unique solution of (5):
Actually, can be obtained through iterations on Bellman equation: is denoted by Bellman operator, representing the iterative improvement on by traversing states throughout the space eventually. During this process, would also converge to its “true” distribution. Then, the optimal policy can be derived:
As we can see from (4)–(8), traditional DP method needs to traverse discrete states iteratively, resulting in tabular utility function. This approach is not suitable to resolve continuous state space problem. Discretization on state space leads to two defects: (i) the unreasonable assuming that utility function is const in each discrete state cell and (ii) the curse of dimensionality.
ADP method mitigates these problems with two operations: (i) sampling mass states effectively from problem space, thus reducing the consumed time on space exploration; (ii) approximating state utility using sampled states, with which the nearoptimal policy, rather than the optimal one, is employed to determine actions. Denoting the sampled states as a set , we have where is the current approximation of utility; is one Bellman iteration from . Then, can be approximated based on . There are multiple options for approximation operation [16, 17]; the least squares approximation is used here: where is approximation coefficients vector and is sampled states set. Normally, a set of features need to be defined to gain an insight on characteristics of the studied problem. The approximated utility function will converge more quickly and be more precise since these features come from pilots’ combat experiences in real world. We have is feature vector.
As a conclusion, the steps of ADP method can be briefly listed as follows:(a)to sample states set in problem space;(b)to get the one iteration improvement utility from current utility ; the initial value of can be set as the reward of initial states; that is, , where is initial state set;(c)to update the next value of approximated utility following (10) and (11);(d)if the policy still needs to be improved, go back to (a).
4. Reward Function and Combat Features
Before giving ADP algorithm, the reward function needs to be discussed firstly since it is a necessary part in ADP steps. As (4) shows, the utility is actually the discounted cumulated along state trace. Thus, a properly defined can better guide plane approach to goal state from any starting state.
The computation of is domain related. As for our scenario, the attacking plane in its goal state gets reward +1, and the target plane in the same state gets −1 as punishment. The reward in other states is 0. Intuitively, with these discrete rewards, the planes will spend more time on space exploration to find trace from starting state to goal state. To better guide the plane, a reward function is defined as where is the expected attacking range of weapons, is the relative distance between planes, and is an coefficient adjusting the influence of in total reward.
With (12), the plane occupying firing position (, , and ) gets reward ; the plane under attack (, , and ) gets reward . In other states, the reward will increase continuously and monotonically from the worst state to the most advantage state. To emphasize the punishment in bad state (the punishment will guide planes to avoid these states), a simple linear transformation is applied to to get the final reward function:
To construct the utility function, some geometric features [18] are specially defined to describe combat situation, as Table 1 shows. These feature are optional for utility approximation in ADP steps because we can use sampled states instead, as (10) shows. However, they are more straightforward to capture the “true” utility of states, and that is why human pilots also use them to judge their situations in realworld combat. In other words, these welldefined features are more representative to approximate utility function.
5. Method
In the combat scenario, the red plane is marked as “my” side, and the blue one is marked as enemy. To describe ADP approach, a reference decision algorithm, the MinMax search algorithm [19] is employed here. The MinMax algorithm looks into future for steps, using domain knowledge to determine the acting consequent before giving final decision.
ADP approach involves two algorithms: (i) the learning algorithm (ADP_Learn()), in which the utility function is approximated, and (ii) the decision algorithm (Rollout_ADP_Policy()), in which the final action is determined based on ADP policy derived from learned utility. The ADP_Learn() algorithm is displayed in Algorithm 1.

In ADP_Learn(), the utility function is approximated with sampled states, which is expected to be sampled from frequently visited space, to fully capture the changes on utility values in these areas. An option is to use the trajectories produced in realworld combat or other authoritative decision tools for air combat, since the trajectories themselves indicate the high probability of being visited in combat. In this paper, a scenario is built to get combat trajectories, where two rival planes all take MinMax policies, and their trajectories are recorded as .
The initial value of is assigned as the reward of (line 1 in Code section) and then is improved rounds. In each round, firstly, the blue plane’s action is determined by MinMax policy (line 3). Secondly, the red plane’s action is selected by applying one step Bellman operator. The changed utility is also recorded (lines 45) for further use. Thirdly, the feature vector is updated (line 6), with which the least squares approximation is performed to approximate according to (lines 78).
The approximated utility function returned by ADP_Learn() already can be used to give decisions, as (8) shows. However, a rollout procedure is employed here to further improve the quality of final decisions, as Algorithm 2 shows.

Assuming the red plane is making decision in Rollout_ADP_Policy(). It will not follow ADP policy directly. On the contrary, it tries each possible action (line 1 in Code section). For each possible action, the red plane’s future state () is rolled out for steps (lines 5–7). The red plane follows ADP policy during this process and the blue one follows MinMax policy. The sum of reward and utility of is compared with the historical best value : if the former is bigger than , then update and record the corresponding best action (lines 8–11). Having tried all possible actions, the red plane gets the best action in state .
6. Simulation and Analysis
The initial state of 1 versus 1 air combat can be classified into 4 basic situations (from red plane’s perspective): offensive, neutral, defensive, and confronting, as Figure 2 shows.
In our experiments, all four initial situations are configured (Table 2) to compare the performance of ADP policy and MinMax policy.
The experiments are arranged as follows. Firstly, a baseline experiment is conducted in which both red and blue plane would take MinMax policy (denoted as ). The decision performance of is treated as the baseline to compare ADP policy.
Secondly, the learned ADP policy is applied with same initial situations. ADP policy is denoted as , where is the learning rounds. For example, means this policy is approximated after 40 rounds. The decision performance is measured with 2 metrics: (a) the average time to win (TTW); the winning states have been defined in Section 2; the attacking plane needs to hold that state for at least 10 seconds to win; (b) the accumulated probability of being killed (APK); this indicates the total risks of one plane during the combat, in which the probability of being killed by enemy would be cumulated. A good policy would result in both small TTW and APK.
To speed up the experiment, a bigger is assigned to red plane which means it can change direction more quickly. This measurement would avoid long time standoff when both planes follow the same policy. This performance advantage will not influence policy comparison since both set experiments use the same configured planes.
The baseline experiments are conducted firstly. Only the result of Setup 4 (confront) is displayed here considering the paper space limitation, as Figure 3 shows.
Figures 4, 5, 6, and 7 show the result of each initial setup where red plane follows ADP policy and blue plane follows MinMax policy. Comparing Figures 7 and 3, we can see the performance of red plane is improved greatly by taking ADP policy, in which the TTW is reduced from 23 s to 10.5 s.
The comparison on decision performance is displayed in Table 3. As we can see, the TTW of is reduced in all setups compared to , especially in Setup 3. This means can guide the red plane to get rid of the chasing quickly and find its way to occupy the firing position. On the other hand, the APK is slightly higher with ADP policy.
These results show that a plane is more offensive when following ADP policy. The plane is likely to occupy firing position risk at the risk of being killed. This phenomenon can be explained from the working mechanism of two decision approaches. In MinMax algorithm, the decision is finally made considering each possible reaction from the opponent. This leads to a conservative style in decision making. By contrast, the ADP approach uses utility to guide the plane to make best profit by acting properly and avoid punishment at the same time. This somehow makes the plane abandon conservative choice for high reward in the future. This result also proved that ADP approach is effective and of high performance to resolve air combat problem
7. Conclusion
This paper studied the 1 versus 1 air combat decision problem and employed ADP approach to resolve it quickly and effectively. The ADP approach involves two operations: (i) learning utility function from mass sampled states rather than from scratch; (ii) making final decision by evaluating the future incoming of any possible action. Comparative experiments show that the policy initially produced by MinMax algorithm is improved greatly after ADP process.
In the future work, we plan to build a hybriddecision framework for UAV in which the decision functionality is divided into two parts. The first part is responsible for making initial acting policy for specific task, which is a knowledgebased decision module, and can handle large scale, complex task environment. The second part is the ADP module proposed in this paper. This ADP module gets state samples from the first part, with which the utility function is approximated and the acting policy is improved. This process combines the advantages of different decision frameworks together.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This research is supported by the Armament Research Foundation (9140A04010112HK01041).