Abstract

Radar multitarget tracking in a dense clutter environment remains a complex problem to be solved. Most existing solutions still rely on complex motion models and prior distribution knowledge. In this paper, a new online tracking method based on a long short-term memory (LSTM) network is proposed. It combines state prediction, measurement association, and trajectory management functions in an end-to-end manner. We employ LSTM networks to model target motion and trajectory associations, relying on their strong learning ability to learn target motion properties and long-term dependence of trajectory associations from noisy data. Moreover, to address the problem of missing appearance information of radar targets, we propose an architecture based on the LSTM network to calculate similarity function by extracting long-term motion features. And the similarity is applied to trajectory associations to improve their robustness. Our proposed method is validated in simulation scenarios and achieves good results.

1. Introduction

In an environment with much clutter, multitarget tracking based on radar detection results is a challenging task. An important branch of its solution is the multitarget tracking algorithm based on data association, which can be generally decomposed into the following three aspects: modeling the motion of the tracked target to estimate trajectory parameters and filter measurements, associating the measurements with the motion trajectory to distinguish the measurements from different targets or background noise and clutter interference, and effectively managing different motion trajectories to determine the birth, retention identity, and termination of different target trajectories.

For modeling the motion of the tracked target, the traditional method is based on the Bayesian filtering theory [1, 2]. Earlier, the Kalman filter is used to obtain unbiased optimal estimation of Gaussian linear targets. For nonlinear motion targets, the following developed methods consist of extended Kalman filter (EKF) [3], unsensitive Kalman filter (UKF) [4], interactive multiple model (IMM) [5] algorithm, and so on. All of these methods require limited motion models in advance. Particle filtering (PF) [6], which does not need the aforementioned assumptions, however, has low computational efficiency and sample poverty problem.

For tracking trajectory association and trajectory management, it can theoretically be regarded as a maximum matching problem of a bipartite graph. For this problem, the Hungarian algorithm (HA) [7] and Kuhn and Munkres algorithm (KM) are first used. Later, many more complex algorithms have appeared, including multiple hypothesis tracker (MHT) [8], which establishes a potential tracking hypothesis tree for each candidate target and calculates the tracking probability to select the most likely tracking combination. Another popular technique is joint probabilistic data association (JPDA) [9]. By building a validation matrix, it can calculate all feasible joint association event probabilities and associate targets based on the obtained scores. However, these traditional methods not only require knowing the number of targets, starting positions, and clutter distribution but also result in “combination explosions” increase of calculation when the number of targets or clutter is enlarged. This will not only reduce the tracking accuracy but also may lead to trace association errors.

Recently, deep learning has made great progress in classification and detection applications in many fields. For example, it is widely used in identifying interference in wireless communication systems [10] and lightweight radio frequency fingerprint identification (RFFID) systems [11]. Especially in computer vision, the target tracking methods in video based on deep learning have also improved, such as pedestrian monitoring [12, 13], car driving monitoring [14, 15], and biological sequence tracking [16]. However, deep learning-based multitarget tracking algorithms that are applied to radar targets are relatively rare. This may be caused by several reasons. First, deep learning requires a large amount of annotated training data, but real data is difficult to obtain in radar multitarget scenarios. Secondly, the measurement of radar targets nearly has no apparent information, which prevents the appearance detection information commonly used in video target tracking. Thirdly, radar target tracking and detection have real-time requirements, and the tracking algorithm generally needs to be used online.

The main contributions of this paper are as follows: (1)To overcome these problems in traditional radar multitarget tracking based on data association, inspired by deep learning ideas, we propose an end-to-end multitarget tracking structure based on recurrent neural networks. We use LSTM networks to model the state prediction and measurement association parts of the multitarget tracking algorithm, respectively. Relying on the powerful learning ability of LSTM networks, it is able to obtain the long-term dependence of multiple tracking target motion states and measurement association by learning from a large amount of training data. Then, it combines status update and tracking management into a unified network structure and finally realizes multitarget tracking. The advantages of this method are that the target motion models and clutter distribution are not required to be known in advance, and the combination explosion problem in the traditional data association method is alleviated. Meanwhile, the tracking process runs online without caching any future frame information(2)To solve the problem of missing appearance information when the neural network is applied to the radar multitarget tracking algorithm, we creatively design a motion feature extraction LSTM to extract motion features from the speed of the targets and calculate the similarity scores, which are used to learn long-dependent measurement associations(3)To address the problem of insufficient training data, we propose a method to obtain extensive training data from simulation models. Our multitarget tracking architecture is also confirmed by a simulation analysis of the simulated data

Lately, target tracking in the video application field has been extended from a single target to multitarget tracking [17]. And the specific methods have also been extended from constructing complex appearance models to focusing on the motion model and interaction of the targets simultaneously [18]. In these applications, deep learning methods play an important role, mainly reflected in the following three aspects: (1)Using the deep neural network to extract high-order features of the tracking targets, especially for the appearance model, this can effectively improve the target tracking performance. Earlier, Reid [8] use AlexNet for extracting deep features from targets for the use of the MHT framework. Leal-Taixe et al. [19] use the Siamese network structure to extract deep features. By combining the depth features and motion information with a gradient enhancement algorithm, the tracking problem is expressed as linear programming and solved effectively. However, there is no apparent information in the radar signal, so these methods are not suitable for extracting the depth features of the radar signal(2)Deep neural networks are used to directly learn key components of a multitarget tracking framework. Chen et al. [20] construct two CNN-based classifiers using the features of fasterRCNN [21] from the VGG-16 model [22] as input and combining the confidence degree of the classifier as particle weights to obtain tracking results by particle filtering. Xiang et al. [23] propose a CNN based on triplet state loss to learn the distance measure between tracker and detection and thus constitute the cost of the bipartite graph, which can be effectively solved by the Hungarian algorithm. However, radar targets are different from video targets. It often has a poor performance to extract radar target features and classify them using the CNN network(3)Deep neural networks are designed directly in an end-to-end manner to obtain multitarget tracking results. The multitarget tracking task involves many interwoven components and is difficult to be modeled as a whole for learning. There have been some preliminary researches recently. Milan et al. [24] unify target state prediction, state update, and existence probability calculation under a whole RNN. Also, they design a set of LSTM networks for data matching matrix calculation and act on the state update process. These architectures achieved good results. Sadeghian et al. [25] design a more complex RNN structure, which is divided into three subnetworks to extract multiframe appearance features, motion features, and interaction features, respectively, and integrate them into a top-level RNN for time-dependent comprehensive inferring, so as to obtain the final matching probability. Experimental results show that this method is more robust

Our architecture is similar to that of [24] but different in three key aspects: First, RNN is used in [24] to deal with motion state prediction. If the time interval is too long, the time dependence will be weakened. Instead, we use the LSTM network for motion state prediction, which enhances the long-term dependence learning ability of target states. Secondly, in the measurement association part, we use the motion feature extraction network based on LSTM to calculate the similarities for data associations, which has stronger anti-interference ability than the method in [24] only using the Euclidean distance between the target and each measurement to carry out data associations. Thirdly, we use different loss functions due to different network architectures.

3. Our Approach

For addressing the aforementioned radar target tracking problem, we propose an online end-to-end radar multitarget tracking network architecture based on LSTM (see Figure 1). Our architecture implements target state prediction, measurement association, state update, and tracking path management under a unified LSTM network architecture, and all model learning is completed in an online end-to-end manner.

3.1. Notations

In our application scenario, the target state vector we care about is , including coordinate, coordinate, axis velocity , and axis velocity , as represented , , represents the maximum number of targets tracked simultaneously in a frame. The measurement value vector is , whose dimension is the same as the target state vector, and represents the maximum number of detected values in a frame, including the targets and clutter. The association probability matrix of the measurement association part is expressed as , each row of which represents the probability that each measurement value belongs to a certain target. That is, presents the probability of the th measurement value belonging to the th target. The added column indicates the probability that the target lacks measurement, so that each row is satisfied . The path management section uses a probability vector to express the probability of each target’s existence, which can accurately describe the target generation and termination processes.

3.2. Motion-Association Multitarget Tracking with LSTMs (MA-LSTM)

In our architectural design (see Figure 1), we partition the state prediction, state update, and path management into a target motion module and measurement association into another target association module. This primarily deals with the diversity of measurement association algorithms and facilitates the replacement of this part for tracking effectiveness comparison. At the same time, the two modules are trained separately for easy convergence.

3.2.1. State Prediction in Motion Module

State prediction is the first submodule of the motion module. The task of this module is to learn a complex motion pattern, which can be noisy and nonlinear, and predict the future motion states based on the trend of the past motion states. This is actually a time series prediction problem, which we designed to achieve by using a LSTM network structure, as shown in Figure 2. We train the network on noisy motion trajectories so that the network can completely learn the long-term dependent motion pattern from the training data without prior knowledge.

More specifically, at the current moment , both the hidden state and the cell state of the prediction LSTM come from the learning of the previous target motion state, and the input is the target motion state at the current moment. Through the internal processing of the LSTM network, as described below, the predicted motion state at the next moment as well as the hidden state and cell state can be obtained to continue the transmission.

The formula is expressed as follows: where is the sigmoid and is the hyperbolic tangent activation functions, and the learnable parameters are expressed as .

3.2.2. State Update and Track Management in Motion Module

Status update and track management is the second submodule of the motion module. The task of this module is to update the motion state of the target and identify and judge the start and end of the target trajectory on the basis of the obtained target prediction, measured values, and association matrix. This is an important step in the multitarget tracking architecture, for taking into account the clutter interference in the multitarget measurements and the association of the measured data. The network structure we designed is shown in Figure 3.

In detail, the measured values at moment are combined with the predicted values output by the state prediction submodule, as . Then, using the dot product with the association matrix output by the target association module, the possible state of the target is . Then, it is multiplied by the target existence probability . With the multiplied result and the hidden state of the prediction module , the updated motion state is obtained after a nonlinear transformation, and the target existence probability is also calculated.

The formula is expressed as follows: where is the sigmoid and is the hyperbolic tangent activation functions, and the learnable parameters are expressed as .

3.2.3. Loss of Motion Module

In our architecture, we are interested in tracking performance-related losses and propose the following loss functions that meet our needs: where is the predicted value of the prediction module, is the predicted value of the update module, and is the predicted value of the existence probability. , are true values.

This loss function in formula (3) we designed consists of four parts. The first part is to predict the motion state of the targets without measured values, and we take the mean square error (MSE) of the predicted values and the real values. The second part is the state prediction after the measurement update. We take the mean square error of the updated values and the real values. The third part is to predict the probability of the target’s existence at each moment, and we use binary cross entropy loss function, as shown in formula (4). The fourth part is the additional smoothing variable , in order to smooth the absolute difference between two consecutive values, so as to prevent the loss of the target in the tracking process.

3.2.4. Target Association Module

The target association module is the most challenging and creative part of the multitarget tracking architecture. We design it as an independent association module and further subdivide it into two parts: motion extraction LSTM and association LSTM.

The fundamental task of target association is to uniquely partition the corresponding measurement values for each tracking target in a clutter interference environment, which is essentially a maximum matching problem of a binary graph. Different from the traditional solution mentioned above, we construct the association LSTM in an end-to-end manner. Relying on the powerful memory learning ability of LSTM, by learning and remembering the long-term dependence of target and measurement associations from abundant data, the association probability of each target to all measurement values is predicted, and finally, the association matrix of all targets is obtained. Our association method satisfies the one-to-one association constraint and is online without viewing future frames. This network structure is shown in Figure 4.

To be more specific, at time , its input is feature vector from the extraction LSTM (see below), and the hidden state and cell state both come from the learning of the previous target assignment. Through the LSTM network, we can get a target distribution probability vector for all available measurements at the last softmax layer, and multiple targets constitute the association matrix .

The input of the association LSTM requires eigenvectors representing similarity measures. The traditional similarity measure function solutions are handmade, and as aforementioned, radar target appearance feature information is absent. In order to solve these problems, we put forward the motion extraction LSTM, an architecture of computing similarity function based on the LSTM network (see Figure 5). Through the long-term time-dependent learning of the target motion velocity features, the similarity score of the target and the measurement can be calculated end-to-end without manually specifying parameters or weights, so as to determine whether the new measurement value is similar to the target motion features in the previous period of time.

As shown in Figure 5, we designed the motion extraction LSTM. The input are the motion velocity features of the th tracked target at the specified time step , producing the -dimensional output after processing through LSTM networks. The other input is the th measured velocity vector at the time step, which we map to an -dimensional vector via an FC layer. The two vectors are then connected and passed to another FC layer, which transforms the 2-dimensional vector into a -dimensional eigenvector . In the pretraining process, we use the softmax classifier at the end to train our model parameters to judge whether the measured velocity feature corresponds to the real trajectory’s velocity feature .

3.2.5. Loss of Association Module

For motion extraction LSTM, we use the classifier and cross-entropy loss function for pretraining.

For association LSTM, in order to measure the cost of inappropriate associations, we adopt the common negative log likelihood loss, as below: where is the correct assignment of target and is the probability of measurement assigned to target at time .

3.3. Implementation Details

We use the TensorFlow architecture to implement our design. We designed the state prediction LSTM with a single layer of 256 hidden cells. The measurement of association task requires more representation ability. The association LSTM uses 256 hidden units in two layers, and the motion extraction LSTM uses 128 hidden units in one layer. The output feature vector is 100 dimensions, and the length of the specific extracted sequence is 10. We use grid search to select the optimal network hyperparameters [26]. We use Adam to update and optimize our framework. The learning rate is initially set at 0.001 decreasing by 10% for every 10 periods. We set the maximum number of iterations to 10,000, which is enough to achieve convergence. Training for these network architectures takes about 10 hours on a GPU.

3.3.1. Training Data

As mentioned above, deep learning network learning requires a large amount of training data. However, there are very few open marker datasets for multitarget tracking of radar signals. Therefore, we generate training data from radar motion simulation models. The training trajectories we generated are set with multiple targets, whose birth time and lifetime are random, and incorporate a large number of random uniformly distributed clutter. The initial position and velocity of each target are randomly distributed within a certain range, and the conventional constant-velocity motion model and constant acceleration motion model of the radar target are adopted. The radar measurements are sampled every 2 seconds, and a Gaussian measurement noise is added, while the annotations for multitarget associations are manually added.

4. Experiments

We have proposed a radar multitarget tracking network architecture. To demonstrate its functionality, we first present experiments on simulated data and then give more insights and analysis of our performance.

In the simulation experiment scene, we choose 2D radar to track multiple moving targets, assuming that the target movements are independent of each other, and trajectory crossing may occur. The equation of motion for each target is where is the target state. is process noise representing random acceleration in the and axis direction, .

We assume that the target follows the constant acceleration (CA) model and the constant velocity (CV) model [27]. Their process equation and process noise are where is the sampling period.

The target observation equation is where is the observation matrix; the measurement noise is the Gaussian white noise with zero mean and covariance matrix .

The observation vector of radar is , where is the azimuth of the sensor observation target and is the distance of the sensor observation target. The target observation model is where and .

On this basis, we set an observation area of , 5 radar tracking targets with random birth and death time. Each target has detected probability . In the observed region, the clutter follows a uniform Poisson distribution, and the clutter intensity is set as.

To generate large amounts of training data, the initial position of the target is randomly distributed within the observation area, the initial linear velocity is randomly distributed from 100 m/s to 300 m/s, and the acceleration is randomly distributed from to . The times of birth and death for each target are randomly distributed between 3 and 18 s. At the same time, the number of targets at each sampling time is manually annotated. From this, we produce 10,000 random paths and their observation data, respectively. These are sufficient to train our network architecture.

The results of the tracking are shown in Figures 6 and 7. From this, we can qualitatively conclude that our method can track multiple moving radar targets in a large number of clutter environments, although the trajectories of these targets may be crossed. Furthermore, we also see that the predicted trajectories always start and end one or two points later than the real trajectories, which is difficult to avoid with the online method.

For quantitative analysis, we compare our proposed method (MA-LSTM) with three baseline methods. The first baseline method is the traditional KF-HA, which employs a combination of Kalman filter and data association via the Hungarian algorithm. The second baseline approach is JPDA mentioned above. The third comparison is comprised of our motion module and the data association using the Hungarian algorithm (M-HA). The optimal submode allocation (OSPA) [28] distance for each method is calculated. OSPA is a consistency measurement method for the overall performance evaluation of the target tracking system, which can be used to measure the error between real track and estimated track and can separate the total error into distance errors and correlation errors. For the tunable parameter distance sensitivity parameter and associated sensitivity parameter , we set and , respectively.

The calculation results are shown in Table 1. MA-LSTM outperforms the other methods in all three aspects. It has great advantages in overall OSPA and associated OSPA and is slightly better than JPDA in terms of distance OSPA. It is worth noting that compared with the HA method for data association, the proposed method of using LSTM for data association and using motion velocity features to calculate similarity has a positive impact on the overall tracking effect, especially on the reduction of association error.

As shown in Figure 8, this intuitively shows the trend of different method OSPA values over time under a long period of test data. It can be seen that in the early stage of the test, the OSPA value of the learning method is high and then decreases rapidly after that, because of the learning of motion properties in the early stages of tracking. The OSPA value of our method is significantly reduced from the 10th frame, because the speed feature extraction sequence length in the data association module is set to 10. That is, the role of data association can be fully played after 10 frames; after that, the performance of our method is better than other methods.

5. Conclusion

This paper presents an LSTM-based network architecture for radar multitarget tracking. This architecture can effectively solve the problems of state prediction, measurement association, and trajectory management for radar multitarget tracking under much clutter. In addition, we propose to use the motion extraction LSTM to extract motion features to calculate similarity scores and use it to learn long-dependent target associations, which achieves good results. Our architecture is able to accomplish tracking tasks online and has been verified in simulation scenarios. In future work, we plan to expand it into the video tracking field and make association strategies more robust by combining more clues to achieve better performance.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding authors upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This research was supported by the Fundamental Research Funds for the Central Universities under Grant 3102019ZX015 and in part by the Fundamental Research Funds for the Central Universities under Grant D5000220131.