Abstract

In this paper, a variable-structure multimodel (VSMM) filtering algorithm based on the long short-term memory (LSTM) regression-deep network (L-DQN) is proposed to accurately track strong maneuvering targets. The algorithm can map the selection of the model set to the selection of the action label and realize the purpose of a deep reinforcement-learning agent to replace the model switching in the traditional VSMM algorithm by reasonably designing a reward function, state space, and network structure. At the same time, the algorithm introduces a LSTM algorithm, which can compensate the error of tracking results based on model history information. The simulation results show that compared with the traditional VSMM algorithm, the proposed algorithm can quickly capture the maneuvering of the target, the response time is short, the calculation accuracy is significantly improved, and the range of adaptation is wider. Precise tracking of maneuvering targets was achieved.

1. Introduction

Strong maneuvering target tracking (MTT) is an important research direction in the field of state estimation. At present, the filtering algorithms for maneuvering target tracking are mainly divided into two categories: the improved single-model algorithm dominated by the Kalman filter and the multimodel-tracking filtering algorithm represented by an interactive multimodel. Because of its strong model-covering ability, the multimodel filtering algorithm has a prominent phenotype in the field of strong maneuvering target tracking. However, there are also many shortcomings in the multimodel algorithm. For example, the interactive multimodel algorithm usually adopts a fixed model set, which is quite different from the changeable movement space in the actual motion.

To solve this problem, Li et al. proposed the concept of a variable-structure model filtering algorithm [13], which expanded the original model set through one or more adaptive models closely following the real model set, and used the directed graph to represent the transition between models to realize the variability of the model set. In reference [3], the peak error, steady-state error, and response time of the filtering algorithm under different model sets are studied via a deterministic scheme (DS). In this paper, it is proposed that too many models will provide less bias stability error, but the smaller model sets will also lose some jump accuracy. At the same time, affected by the performance of its own algorithm, the performance improvement of excessive model sets and complex topological relationships is not proportional to the complexity of the solution, and the response of the algorithm is slow.

Wang et al. [4] also proposed that expected mode augmentation (EMA) could improve the accuracy in a few cases compared with the EMA algorithm, but it sacrificed a large computational fast response capability compared with the EMA algorithm, which was not proportional to the accuracy improvement of the algorithm. Wang et al. [5] proposed too few models; for the strong maneuvering target, the coverage ability of the model set will be poor; and only when the target movement just falls within the model set will the tracking effect be better. However, in the actual movement of the target, the change of the model is unknown and difficult to cover. At the same time, the division of the model depends on the prior division of the model set and the fixed topology.

Intellectualization of filtering is the hot spot direction of current complex filtering in the tracking problem. The cyclic neural network was first proposed by Lipton et al. [6]. It has outstanding performance in temporal data processing such as speech recognition, text generation, and machine translation.

Wei et al. [7] solved the problem of cooperative target hunting in the underwater environment by using the deep -learning (DQN) algorithm of neural networks. Fang et al. [8] adopted adaptive time slot (TS) and power allocation schemes to switch between different operating modes of machine communication devices, achieving optimization between peak age of information (AoI) and power consumption in energy harvesting- (EH-) assisted large-scale multiaccess networks. Zhang et al. [9] and Wang et al. [10] have achieved innovative applications of adaptive technology in nonorthogonal multiple access, ad hoc networks, and other fields. At the same time, because of the advantages of a recurrent neural network in processing a sequential sequence, it can be combined with the time series characteristics of filtering. Therefore, the introduction of the recurrent neural network to improve the filtering algorithm is a popular research field in the intelligent filtering algorithm. The paper proposes to combine the LSTM network with a hypersonic vehicle-tracking filter algorithm [11]. Kim et al. [12] developed a human posture estimation algorithm combining the Kalman filtering algorithm and recurrent neural network, which improved the accuracy of tracking results. Reinforcement learning is a process in which agents interact with the environment continuously under the driving of reward function to get the maximum cumulative reward and adopt the optimal strategy of learning. DQN has been widely used in the field of decision control because of its advantages [13, 14].

Compared with the traditional VSMM algorithm, the improvement of the R-DQN-based improved VSMM algorithm proposed in this article is as follows: (1)Using the DQN algorithm to handle the selection and decision-making problems of the model set(2)Use the additional LSTM algorithm for compensation. Compared with traditional algorithms, this algorithm adds three online-trained neural networks, so the computational complexity of the algorithm proposed in this article is significantly higher than that of traditional algorithms

The remainder of this article is structured as follows. The Variable-Structure Multiple-Model (VSMM) filtering and the Markov Decision Process (MDP) are detailed in Section 2. In Section 3, the L-DQN algorithm is introduced into the process of model selecting in VSMM, and an LSTM network is used to modify the error, followed by our experimental results and result analysis in Section 4. In the end, conclusions are reached in Section 5.

2. Problem Description

2.1. Variable-Structure Multiple-Model Filtering

The core of the VSMM method is the addition and deletion of the model set. When and how to make changes to the model set are included in the model set decision of the VSMM algorithm. The main steps of the algorithm are input interaction, filtering algorithm, model probability update, interactive output, and model set decision.

2.1.1. Step 1: Input Interaction [15]

where is the transition probability from model to , is the model probability of model at time, and is the mixing probability of the model at time.

At the same time, the upper corner mark represents the model label, and the lower corner mark represents the time. The representations in the following filtering algorithms are the same and will not be repeated.

2.1.2. Step 2: Filtering Algorithm

The one-step prediction of state vector is

where is the driving matrix.

The one-step prediction of error covariance is

represents the process noise covariance matrix.

The filter is

where represents the Kalman filter gain matrix, represents the filter residual, is the measurement vector, and is the estimate of the measurement vector.

The filtering covariance is

The estimation of the observation equation is as follows:

Kalman gain is as follows:

represents the filter residual covariance, and represents the measurement noise covariance.

2.1.3. Step 3: Model Probability Update

represents the likelihood function, which defines a Gaussian distribution (also known as normal distribution) with a variable , the mean value as 0, and the variance as .

2.1.4. Step 4: Interacting Output

where represents the state vector and represents the error covariance.

2.1.5. Step 5: Model Set Decision

At present, there are mainly two ways of model decision adaptation: the likely model set (LMS) [1, 2] method and the expected mode augmentation (EMA) [3] method. The EMA method extends the expected model set to the existing model set, while the LMS method divides the model according to the importance of model probability, so as to approximate the accuracy of the model with the least model as possible. The LMS method takes the directed graph algorithm as the switching algorithm of the VSMM. It stipulates that only the adjacent models can be converted to each other, but the nonadjacent models cannot be converted to each other. The model set participating in the filtering calculation is selected according to the model probability to delete and add the model. The EMA algorithm relies on a fixed topology and selects the model based on the fixed topology. It is suitable for models with additivity and continuous model space. In each cycle, the existing model is weighted to obtain the expectation, and then, the dimension of the obtained expectation model is extended to the existing motion model.

2.2. Markov Decision Process

Through the expression formula given above, the filtering process itself has a Markov property. It is a Markov process, which can be represented by tuple , where represents the finite state set and represents the state transition matrix [16, 17].

The Markov Reward Process (MRP) adds reward and attenuation coefficient (used to calculate the cumulative reward) based on the Markov process. The Markov Decision Process (MDP) adds the decision process based on the Markov Reward Process. Compared with MRP, MDP adds action set A, which is represented by tuple . The mathematical expression is . The reinforcement-learning process based on MDP is the theoretical basis of solving reinforcement-learning problems and the bottom mathematical model of reinforcement learning. The Markov nature of filtering and reinforcement-learning algorithm provides the feasibility for the combination of the two.

The reinforcement-learning decision problem can be understood as a nonlinear mapping, which is a mapping from the state to the action according to the strategy, , where is the state information of the environment, and is the action instruction according to the strategy state, where strategy represents the probability of selecting an action in a certain state in the process, and represents the set of execution probabilities of an action. The strategy calculation equation is as follows:

In order to evaluate the return value of each strategy , the cumulative reward function is defined, which is called the value function, and the calculation equation is defined as

where represents the reward function and represents the discount factor of reward.

The value function based on policy is divided into the state value function: it represents the long-term return in this state , and the action value function: it represents the long-term return of taking the action in this state .

3. Variable-Structure Multimodel Filtering Based on L-DQN Algorithm

The structure diagram of the improved L-DQN algorithm is shown in Figure 1. The algorithm uses the DQN algorithm to make judgments and decisions on model set switching and uses LSTM to directly compensate the error of the filtering and positioning results of the target.

In the figure, white represents the original module of the VSMM algorithm, green represents the newly added submodule of L-DQN, and blue represents the improved module of L-DQN.

3.1. Markov Decision Process

The DQN algorithm combines the deep-learning algorithm with the reinforcement-learning algorithm and uses the deep neural network to replace the action value function. The algorithm uses the experience replay and the target network to ensure the nonlinear learning ability of the algorithm.

3.1.1. Design of Action Space

Different model sets are set as different labels, and different labels correspond to different actions; then, the selection of models could be converted into the selection process of agent actions. Since the size of the action space determines the coverage ability of the model set, during offline training, we put as many labels of the model set into the action space as possible and use the strategy value function to output the most possible action and its probability. The actions selected by the agent can be divided into two categories: one is the action of model selection and the other is the termination action, which is used to terminate exploration.

The exploration of action adopts a strategy, which can balance the relationship between exploration and utilization, and make a compromise between exploration and utilization with a certain probability [18].

where is the exploration factor, usually taken as ; is the set of optional actions of the agent; is the current state of the agent; and is the strategy adopted by the algorithm.

3.1.2. Design of State Space

In the process of target tracking, the model probability of the target is the basis for dividing the importance of the model. It can be seen from Equations (8) and (9) that the calculation of the model probability is mainly given by the filtered residual and the covariance of the residual. Therefore, we take the residual and covariance of the model as the observation of model decision-making, so that the agent can learn to make model decision through the residual and covariance of the model at the current time. At the same time, we take the output action probability value as the new probability of the model to obtain the accurate model probability.

3.1.3. Design of Reward Function

The reward function determines the agent’s performance in the environment, which is the specific numerical value of the target task, and determines whether the optimal strategy can be learnt by the agent.

Moreover, the reward function influences the exploration-exploitation trade-off in reinforcement learning. A well-designed reward function can encourage the LSTM to explore different actions and states to discover optimal policies.

is the mapping function of the real model set, and is the mapping function of the agent selection action model set.

3.1.4. Network Structure

The strategy network of the DQN network is composed of two parallel networks with the same network structure. One is used to generate the current value, and the other is used to generate the target value. The two networks only have different network parameters. When calculating the target network parameters, they are the network parameters before several time steps of the current network. The network structure is shown in Figure 2.

One DQN network strategy network is made up of two identical parallel network structures: one is used to generate the current value and the other is used to generate a target value. Two networks have different network parameters. Calculating the target network parameters is also part of the calculation of the current network parameters of a single time step before.

3.2. Error Modification Network
3.2.1. Introduction to LSTM Network

RNN is a neural network used to process sequence data, which can fuse historical information with new input information. The schematic diagram of the network structure is shown in Figure 3.

After the RNN network receives the input at time , the value of the hidden layer is and the output value is . The key point is that the value of depends not only on but also on .

As a result of gradient descent and gradient explosion in the process of long sequence training, it is difficult for RNN to learn the useful information far away from the processing information. Compared with traditional RNN, LSTM performs better in longer sequences.

The LSTM algorithm is used to design the error modification network as the regression network of tracking and positioning results in compensating the modeling nonlinear error and modeling uncertainty.

3.2.2. Network Input/Output Selection

In the filtering equation, is a quantitative description of the quality of the estimate, and the selection criteria of gain are as follows: the selection criteria of the gain matrix minimize the estimated mean square error matrix , where is the estimation error [15]. Since the state covariance of the filter, that is, the mean square deviation matrix, is the value of each step filter estimation and reflection, and the gain matrix is the best way to minimize the mean square deviation matrix, the errors caused by model uncertainty and the nonlinearity of the driving matrix in the tracking filtering algorithm are mainly reflected in the filter gain and the filtered state covariance.

At the same time, according to the estimation Equation (4), the residual also determines the accuracy of target tracking.

Therefore, this paper uses the three variables mentioned in the above to describe the error compensation network as the input of error compensation for training. The improvement made is to embed the error modification network into the subfilter of the model with the highest probability, that is, the model corresponding to the action with the highest probability selected by the agent. Here, we call it the main filter and ignore the minor filter. Error modification is made for the filter at each hour. Under the condition that the intelligent decision of the model is correct, the nonlinear error of the model in the filtering algorithm is minimum, and the model description is more accurate.

The mathematical expression of the error modification network is as follows:

4. Experimental Results and Result Analysis

4.1. Setting of Network Parameters

The setting of the reward function reflects whether the decision made by the agent at the current moment is correct or not. The establishment of the discount factor should consider not only the current return but also the future reward return. Therefore, the future reward return should be attenuated. In the initial moment of the simulation, the discount factor should be set as 0.9 to prevent the local optimality phenomenon and attenuate with the increase of step length as the interaction with the environment progresses.

At the initial stage of algorithm setting, the target collects experience before training. If much experience is not collected, the training will not be carried out. The training will not be conducted after each interaction with the environment but after 4 interactions with the environment.

Greedy decision-making: when using the strategy corresponding to the action value to estimate the action, strategy is used. The setting of the exploration factor decreases gradually with the progress of training. At step , , then decrease linearly until , and keep unchanged. In this way, each step of the agent’s strategy is determined based on the maximum value of the current state during the training process.

The learning rate is , and the RMSProp optimizer is used to train 20,000 rounds.

The data comes from the target’s 2d motion track data, including turning motion, CA motion, CV motion, and their combinations. At the same time, the noise of the data is increased for the adopted training data to increase the amount of available training data and increase the adaptability of the algorithm. The noise we have added is white Gaussian noise, and the synthetic datasets consist of the sum of the measured data and the added noise.

4.2. Error Modification Network

Since the maneuvering of the aircraft is complex, the number of training rounds is increased. After 20,000 rounds of training, the output reward function image is shown in Figure 4.

4.3. Comparative Analysis

Three deterministic schemes are selected to evaluate the pros and cons of the improved algorithm proposed in this paper. Deterministic schemes 1, 2, and 3 are called DS1, DS2, and DS3, respectively, to study the peak error, steady-state error, and response time of the algorithm. DS1, DS2, and DS3 select the maneuvering tracking target model set to cover the target model set, the tracking target model set is more than the target model set, and the tracking target model set does not completely cover the target model set, so as to compare the advantages of the proposed algorithm in different simulation scenarios.

In order to design more complex maneuvers to verify the advantages and disadvantages of the algorithm, we assume that the target moving in the two-dimensional horizontal plane, z direction is zero. The initial position and speed of the target are set as (5000m, 5000m) and (500m / s, 400m / s), respectively, and the flight time is 200s.

The position and speed of the object are selected as the filtered state quantity, which is expressed as . The sampling interval is 1 s, the measurement data is the position of the object body, and the measurement matrix is

Circular turning and uniform linear motion are selected to form the motion model set.

The CT turning model is as follows: The constant-velocity moving model is as follows:

The maneuver scheme is shown in Table 1, and the tracking model set in EMA and LMS algorithms is consistent with DS1.

The tracking result diagram of DS1 is shown in Figure 5.

The tracking result diagram of DS2 is shown in Figure 6.

The tracking result diagram of DS3 is shown in Figure 7.

In Figures 57, indicates the positioning error.

The tracking results are summarized in Table 2.

Through comparative analysis, it can be concluded that when the tracking model set just covers the target motion model set, the EMA, LMS, and L-DQN algorithms proposed in this paper have a more accurate tracking accuracy. Meanwhile, compared with the first two schemes, the improved L-DQN effectively reduces the mean square error of tracking, improves the jumping accuracy of the model, and shortens the response time of the algorithm. For scheme DS2, it is simpler than the operation scheme of DS1, so the accuracy of the three is improved to a certain extent, but compared with DS1, the simulation shows that both the LMS and EMA algorithms lose more jumping accuracy in comparison to L-DQN. For the case that the target model falls outside the algorithm model set, in addition to the loss of jumping accuracy, there was also a certain degree of divergence in both cases, and the result of the tracking error being greater than the positioning error appeared.

The improved L-DQN can use offline training to solve the problem of model coverage and online response time and realize the positive correlation between response speed and model coverage ability and model accuracy.

According to the simulation results, it can be concluded that the calculation time required for the system to perform localization is much smaller than the time interval between each localization, which can meet the real-time requirements in the case of limited computing resources.

5. Conclusion

In order to improve the tracking accuracy of strongly maneuvering targets, this paper proposes a VSMM algorithm based on L-DQN. Based on the variable-structure filtering algorithm, the model selection of the target is mapped to the choice of the action labels, and reasonable state space observation variables, network structure, and reward function are designed to achieve the purpose of using the DQN algorithm to replace the traditional model decision. At the same time, the LSTM algorithm is introduced to compensate the tracking and positioning error. Experimental results show that the proposed algorithm can solve the problem of incomplete model coverage in the face of an unknown maneuver mutation, shorten the response time which is caused by too many model sets of the algorithm, and realize the positive correlation between the calculation accuracy and response speed of the algorithm. The algorithm does not depend on the fixed topology of the model and the inherent prior knowledge such as the fixed threshold, so it has good adaptability and stability.

The LSTM algorithm still has limitations in tracking maneuvering targets. LSTM models are designed to capture dependencies in sequential data, but they may struggle to capture the complex and nonlinear dynamics of maneuvering targets. Besides, sensitivity to input representation cannot be ignored. LSTM models heavily rely on the quality and representation of input features. If the input features do not adequately capture the relevant characteristics of maneuvering targets, the LSTM model’s performance may be limited.

The RL algorithm used in target tracking has several potential limitations. RL algorithms typically require a great number of interactions with the environment to learn effective policies. In target-tracking scenarios, collecting sufficient data can be challenging, especially if the target’s behavior is rare or difficult to observe. Designing an appropriate reward function is crucial in RL. For target tracking, defining a reward function that accurately reflects the desired tracking behavior can be difficult. It may be challenging to strike a balance between rewarding the agent for successful tracking and penalizing it for incorrect or inefficient actions.

Due to the limitations in algorithm performance and other factors, the application of this algorithm in practical engineering is currently limited. However, the method that the L-DQN algorithm used in VSMM and the error modification network based on the LSTM network has certainly enhanced accuracy and reduced errors compared to traditional methods and can be gradually expanded to practical engineering after improvement.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation (NNSF) of China under Grant 62101579.