Abstract

Multiple-object tracking is a challenging issue in the computer vision community. In this paper, we propose a multiobject tracking algorithm in videos based on long short-term memory (LSTM) and deep reinforcement learning. Firstly, the multiple objects are detected by the object detector YOLO V2. Secondly, the problem of single-object tracking is considered as a Markov decision process (MDP) since this setting provides a formal strategy to model an agent that makes sequence decisions. The single-object tracker is composed of a network that includes a CNN followed by an LSTM unit. Each tracker, regarded as an agent, is trained by utilizing deep reinforcement learning. Finally, we conduct a data association using LSTM for each frame between the results of the object detector and the results of single-object trackers. From the experimental results, we can see that our tracker achieves better performance than the other state-of-the-art methods. Multiple targets can be steadily tracked even when frequent occlusions, similar appearances, and scale changes happened.

1. Introduction

Multiobject tracking in videos plays an important role in a wide range of applications, for example, video surveillance, robot navigation, intelligent transportation systems, and video analysis, to name a few [1, 2]. Despite that the field has made tremendous progress since early work, visual multiobject tracking is still regarded as a challenging problem due to frequent occlusions, appearance similarity between objects, varying number of objects, and environmental noise within measurements [3, 4].

1.1. Related Work

Tracking-by-detection methods [57] have appeared as one of the most successful strategies due to recent advances in methods for object detection [810]. Most of the recent tracking-by-detection algorithms aim at decomposing multiobject tracking into two stages: object detection and data association. These algorithms apply the object detector in each frame and associate the results of the detector continuously. Therefore, this kind of multiobject tracking method can recognize emerging or disappearing objects in the sequences of a video more easily, and the search space of object hypothesis can be greatly reduced.

Tracking-by-detection methods are frequently classified roughly into two categories: offline approaches and online approaches. Offline approaches often use the detections of all the frames of the video sequence together to build long trajectories against false detections and occlusion. A crowded or cluttered scene usually causes some detection failures, which will decrease the accuracy of data association in turn. To compensate for these problems, many multiobject tracking algorithms using the global data association have been proposed [1114]. However, the performance of the offline approaches is still limited, and it is hard to apply the offline approaches to real-time applications. As data associations between detections and trackers for each frame are performed in an online manner, we can apply the online methods to real-time applications. Bae and Yoon [15] proposed a novel online visual multiobject tracking approach that can handle the similarity between multiple objects.

Data association is the major issue of tracking-by-detection methods [16]. Classical data association approaches include the joint probabilistic data association filter (JPDAF) and multihypotheses tracking (MHT) [17]. JPDAFs consider all possible associations between objects to make the best assignment in each time step. MHT considers multiple possible associations over several time steps, but its application can be usually limited due to its complexity. Many recent multiobject tracking algorithms have concentrated on enhancing the performance of the object detector or designing better data association schemes [1820].

In recent years, LSTM has attracted increasing attention in modeling sequential data. The applications cover feature selection [21], machine translation [22], action recognition [23], video captioning [24], and human trajectory prediction [25]. The main advantages of LSTMs for modeling sequential data is that they allow end-to-end fine-tuning and they are not confined to fixed-length inputs to outputs. Inspired by the successful works that have applied LSTM in computer vision fields, we adopt a data association method based on LSTM in this paper. LSTM includes nonlinear transformations and memory cells, which makes it effective for data association.

Most previous multiobject tracking methods represent objects using raw pixel and low-level handcrafted features, such as histograms of oriented gradients (HOG) [26], Harr-like features [27], and local binary patterns (LBP) [28]. Although they achieve computational efficiency, they have many limits because handcrafted features cannot capture more complex characteristics of the objects. Recently, deep learning has received much attention with state-of-the-art results in complicated tasks such as object detection [29], image classification [30], object recognition [31], and object tracking [32]. A deep-learning tracker (DLT) was proposed in [33], which uses a stacked denoising autoencoder to learn the generic features from a large number of auxiliary images offline. However, the DLT tracker cannot describe the temporal invariance of deep features, which is important for visual object tracking. In [34], a deep-learning tracking method was developed that uses a two-layer convolutional neural network (CNN) to learn hierarchical features from auxiliary video sequences; in the visual tracking method, appearance variations and complicated motion transformations of objects are taken into account. In [35], the authors present a visual tracking algorithm, which includes a specific feature extractor with CNNs from an offline training set; both spatial and temporal features can be learned by the CNNs jointly from image pairs of two adjacent frames. These deep-learning trackers often overlook how to search the interesting region of objects and select the best candidate as the tracking result.

With the recent exciting achievements of deep learning, integrating deep-learning methods with RL has recently shown very promising results on decision-making problems, that is, deep reinforcement learning (DRL). Deep neural networks are able to make reinforcement learning algorithms perform more effectively because they can provide deep feature representations. DRL algorithms have achieved unequalled success in many challenging domains, for example, Atari games [36] and playing board game GO [37]. In the computer vision community, there are also many attempts of applying DRL to solve traditional tasks, such as action recognition [38], object localization [39], object tracking [40], and region proposal [41]. Yun et al. propose an end-to-end active object tracking algorithm via reinforcement learning, which addresses tracking and camera control simultaneously [42]. In [43], the authors present action-decision networks for visual tracking with deep reinforcement learning. However, these tracking methods based on deep reinforcement learning usually focus on a single object; there is little work related to multiobject tracking. Unlike the aforementioned methods, our method exploits how to apply deep reinforcement learning to solve the online multiobject tracking problem.

1.2. Summary of Contributions

Our motivation is to design a real-time multiple-object tracker via LSTM and DRL, which can incorporate appearance by DRL and learning a more effective association strategy by LSTM to improve the performance of tracking. The key contributions of this paper can be summarized as follows: (i)We propose a novel visual multiobject tracking algorithm based on LSTM and deep reinforcement learning to solve the problems in the existing methods, which is model-free and requires no prior knowledge. To the best of our knowledge, we are the first to combine such concepts to overcome problems in the process of the visual multiobject tracking.(ii)The proposed multiobject tracker includes three modules: an object detection module, a number of single-object trackers, and a data association module. We adopt YOLO V2 as an object detector as it is a real-time detection system. Each single-object tracker is treated as an agent, which is trained using DRL. An LSTM-based architecture is adopted to solve the joint data association problem.(iii)To compare our multiobject tracker with other state-of-the-art methods qualitatively and quantitatively, we conducted extensive experiments on publicly available challenge benchmark datasets.

The rest of our paper is structured as follows: Section 2 reviews the background. Section 3 introduces the proposed multiobject tracking framework. Section 4 demonstrates the experimental results and analysis. Finally, we draw conclusions in Section 5.

2. Background

2.1. Long Short-Term Memory (LSTM)

Traditional recurrent neural networks (RNNs) contain cyclic connections that make them a powerful tool to learn complex temporal dynamics, as shown in Figure 1. The formulas that govern the computation happening in a RNN are as follows: where is an element-wise nonlinearity function, and represent the input vector and the output vector at time step , and is the hidden-layer vector with hidden units at time step . , , and are the weight matrices of the connection from input nodes to hidden nodes, hidden nodes to hidden nodes, and hidden nodes to output nodes.

Though RNNs have been successfully used for sequence modeling tasks, they can only model the data within a fixed-size window. At the same time, training conventional RNNs is difficult due to the problem of exploding and vanishing gradients. These problems limit the capability of RNNs to learn long-term dynamics. LSTM was proposed in [44] to solve these problems. The LSTM unit is used in this paper as described in [45], as shown in Figure 2.

In this subsection, we provide the equations of LSTM for a single memory unit only. Let be an input sequence and represent an output sequence; an LSTM network computes a mapping iteratively between and using the following equations: where is the logistic sigmoid function, is the cell input activation vector, describes the input gate, represents the forget gate, and output gate. All of the above are the same size as the hidden vector . That is, in addition to a hidden vector , the LSTM includes an input gate , forget gate , output gate , and memory cell . We can find the meaning of the weight matrix; for example, represents the hidden to input gate matrix and represents the input to output gate matrix. bi, bf, bo, and bc are the bias terms which are added to i, f, o, and c.

2.2. Deep Reinforcement Learning (DRL)

Reinforcement learning (RL) can usually be used to solve sequential decision-making problems. The process of reinforcement learning is shown in Figure 3. Recently, significant progress has been made by combining reinforcement learning with the ability for learning feature representations in deep learning. Deep Q network (DQN) and policy gradient are two popular methods in DRL algorithms. DQN is a form of -learning with function approximation using a neural network, which means it tries to learn a state-action value function given by a neural network in DQN by minimizing temporal-difference errors. To improve performance and keep stability, various network architectures are based on the DQN algorithm such as dueling DQN [46] and double DQN [47].

A policy gradient approach is a type of reinforcement learning method that directly optimizes parametrized policies by using gradient descent [48]. Policy gradient methods have many advantages compared to traditional reinforcement learning approaches. For example, they need fewer parameters to represent the optimal policy than the corresponding value function and they do not suffer from the difficult problem caused by uncertain state information.

3. Proposed Visual Multiobject Tracking Algorithm

In Subsections 3.13.3, we show a brief architecture of our proposed multiobject tracking algorithm firstly. The details of our method are described in the following content.

3.1. Architecture of the Proposed Multiobject Tracking Algorithm

Our method consists of three major components: an object detection module, many single-object trackers, and a data association module, which are shown in Figure 4. In the first place, as demonstrated in Figure 4, we choose YOLO V2 [49] as an object detector because it is a state-of-the-art, real-time object detection system. YOLO V2 is applied on every frame and outputs a set of detections at time step . In each frame, YOLO V2 may output many kinds of detections. To obtain the correct detections to the tracking objects, the intersection-over-union (IoU) distance is computed between the ground truth and the detections at the first frame. The IoU distance between the mean of its short-term history of validated detections and the current detections is also computed to obtain the correct detections at the other frame. Secondly, the single-object tracker is composed of a network that includes a CNN followed by an LSTM unit. Each tracker, regarded as an agent, is trained by utilizing deep reinforcement learning. Finally, inspired by [50], we adopt an LSTM-based architecture that can learn to solve the joint data association problem from training data.

3.2. Single-Object Tracker via Deep Reinforcement Learning

We cast the problem of object tracking as a Markov decision process (MDP) since this setting provides a formal strategy to model an agent that makes a sequence of decisions. In our formulation, a single-frame image is considered as the environment, in which the agent transforms a bounding box using a set of actions. The MDP includes a set of actions , a set of states , a state transition function , and a reward signal . Our single-object tracking framework is illustrated in Figure 5. This section presents details of these components.

In our paper, the set of action is composed of six actions that can be applied to the bounding box and one action to terminate the search process, as shown in Figure 6. Each action is encoded by the 7-dimensional vector. These actions are organized in three subsets: horizontal moves {right, left}, vertical moves {up, down}, and scale changes {scale up, scale down}.

The state definition is a tuple , where is the image patch (which is pointed by a 4-dimensional vector ) within the bounding box of the object and is a vector with the history of taken actions. The history vector stores the past 10 actions, which means has 70 dimensions as each action vector has 7 dimensions. At time step t + 1, the state is decided by and the state transition functions, where and .

The agent will receive a reward signal from the environment during the training process. In our method, reward is given at the end of a tracking episode when the object is tracked successfully. More specifically, the reward signal during iteration in MDP in a time step. When the “stop” action is selected at termination step , the reward signal is a thresholding function of IoU as follows: where represents the overlap ratio of and the ground truth of the object.

We adopt policy-based reinforcement learning methods as they have a better capability of learning random policies and convergence properties. Our whole network is parameterized by , the policy-based method models, the policy function , and the value function ; the aim of training this network is to maximize the overall tracking performance by policy gradient approximation. At each time step , the goal of the agent is to learn a policy function . Approximation of the policy function can be obtained by a stochastic gradient ascent algorithm. As there are very limited amounts of labelled data for multiobject tracking, we use synthetic data as a supplementary to the real data in the training. The parameters and can be learned according to the following equations: where is the sum of future rewards up to time steps, , is the learning rate, is an entropy regularizer, and is the regularizer factor.

Our deep CNN is conducted on the VGG-16 network, which includes five pooling stages, that is, Conv1-2, Conv2-2, Conv3-3, Conv4-3, and Conv5-3. The gradual decrease in the spatial resolution occurs when the depth of layers increases, because all convolutional layers have a kernel size and a stride of 2 in the VGG-16 model. For example, when inputting an image with size , the output feature maps of pooling 5 have a size . In our model, we use the feature maps from Conv3-3, Conv4-3, and Conv5-3, which have been elevated to the same size by using bilinear interpolation.

3.3. Data Association

Let represent the set of all outputs of single-object trackers at time step , refers to the state of the th output of a single-object tracker, and is the number of objects that can be tracked simultaneously in one time step. The state of the th object is represented by the 4-dimensional vector . We define as the set of detections from the object detector with the th detection and the number of detections. Let denote the similarity matrix for data association that measures the relation between an output of single-object tracker and a detection , where is the Euclidean distance between and . Data association based on LSTM for object is illustrated in Figure 6.

The task of data association is to predict the assignment for each object using the temporal step-by-step functionality of LSTM. The inputs at each step are the hidden state , the cell state , and the similarity matrix . The output are the hidden state , the cell state , and the assignment probability vector . is a vector of assignment probabilities for object and all available measurements, which is obtained by applying a softmax layer with normalization to the predicted values. (object assigned to the th detection) and . Let be the correct assignment; we adapt the negative log-likelihood loss as the cost function to measure the misassignment cost:

The data association requires more representation power, so it is a more complex task. The data-association-module-based LSTM include two layers and 512 hidden units. It takes approximately 40 hours to train all the modules in our tracker on a CPU. The training can be sped up significantly by using GPUs.

4. Experiments

4.1. Qualitative Evaluation

In this section, we compare our visual multiobject tracker with several state-of-the-art methods on the MOT Challenge benchmark [51] in order to show the performance of our algorithm. The synthetic datasets OVVV [52] and virtual KITTI [53] are used as supplementary to the real data in the training. In the single-object tracker, the learning rate for CNN is set to 0.0001, and for fully connected layers it is set to 0.001. In the DRL network, the learning rate is set to 0.0001, and the regularizer factor is set to 0.01, , .

The PETS09-S2L2 sequence consists of 436 frames of 768576 pixels with heavy crowd density and illumination changes. The pedestrians undergo severe occlusion and scale changes in the sequence. The ADL-Rundle-3 sequence consists of 625 frames of 19201080 pixels. It shows a crowded pedestrian street captured from a stationary camera. Frequent occlusions, missed detections, and illumination variation happen among the multiple objects. The TUD-Crossing sequence shows a road crossing from a side view. It consists of 201 frames of 640480 pixels and includes the nonlinear motion, objects in close proximity, and occlusions. The AVG-Town Center contains 450 frames of 19201080 pixels. It shows a busy town center street from a single elevated camera. The sequence contains medium crowd density, frequent dynamic occlusions, and scale changes.

We compare our method (LSTM_DRL) with other state-of-the-art trackers including RNN-LSTM [50], LP_SSVM [54], MDPSubCNN [55], and SiameseCNN [56]. Figures 7, 8, 9, and 10 demonstrate the qualitative tracking results of our tracker on PETS09-S2L2, ADL-Rundle-3, TUD-Crossing, and AVG-Town Center. Figures 11, 12, 13, and 14 show the sample tracking results of other trackers on PETS09-S2L2, ADL-Rundle-3, TUD-Crossing, and AVG-Town Center.

From these experimental results, we can see that our tracker performs well most of the time despite frequent occlusions, similarity among objects, scale changes, and illumination changes. Nevertheless, there are still some examples of unavoidable tracking failures as illustrated in Figure 15. For example, the brightness of the environment results in the failure of object detection in frame number 285 from the PETS09-S2L2 dataset and there are some missing detections in frame number 255 from the AVG-TownCentre dataset.

To illustrate the contribution of each component, the detection result and the tracking result of single-object trackers are shown in Figure 16. Limited to the space, we only list the results on ADL-Rundle-3.

From the results, we can see that the object is missed in the detector, while he is tracked in the single-object tracker according DRL.

4.2. Quantitative Evaluation

The CLEAR MOT performance metrics are used in this section for quantitative evaluation: the multiple-object tracking accuracy (MOTA), the multiple-object tracking precision (MOTP), false positive (FP), and identity switches (IDSW). MOTA evaluates the accuracy composed of false negatives, false positives, and identity switches. where , , , and are false negatives, false positives, identity switching, and ground truth at frame .

MOTP is the average dissimilarity between all true positives and their corresponding ground truth objects, which calculates the intersection area over the union area of bounding boxes. This is computed as where denotes the bounding box overlap of object with its assigned ground truth object and is the number of matches in frame .

Table 1 reports the quantitative comparison results of our tracker (LSTM_DRL) with other state-of-the-art trackers on the 11 sequences of the MOT Challenge dataset.

From the results of Table 1, we can see that our proposed method provides the highest MOTP values and the lowest FN values on the PETS09-S2L2 dataset, provides the highest MOTA values and the lowest FN and IDSW values on the ADL-Rundle-3 dataset, provides the highest MOTA values and the lowest FP and FN values on the TUD-Crossing dataset, and provides the highest MOTP values and the lowest IDSW values on the AVG-Town Center dataset. The proposed method obtains a better performance that can mainly be attributed to the three parts of the tracker: YOLO V2 is a state-of-the-art object detector, the data association strategy based on LSTM can find a global optimal assignment, and the single-object trackers are able to find the location of the object via deep reinforcement learning.

We implement the experiments of our proposed multiobject tracking algorithm based on the Windows 10 operating system and using MATLAB R2016b as the software platform. The configuration of the computer is Intel® Core™ i7-4712MQ and GeForce GTX TITAN X GPU, 12.00 GB VRAM.

The results of running time on the MOT Challenge test dataset are shown in Table 2, where they are compared to some state-of-the-art trackers. Our method is a real-time tracking system and although the speed is slower than RNN-LSTM, which does not incorporate appearance, the other performance of our method is better than it.

5. Conclusion

This paper proposes a visual multiobject tracking algorithm based on LSTM and deep reinforcement learning to overcome the problems of the existing algorithms: they have many limits because handcrafted features cannot capture more complex characteristics of the objects, tracking fails when the number of objects vary, and so on. We adopted the object detector YOLO V2 to detect the multiple objects. The single-object tracker is composed of a network that includes a CNN followed by an LSTM unit. Each tracker, regarded as an agent, is trained by utilizing deep reinforcement learning. We conduct data association using LSTM for each frame between a pretrained object detector and a number of single-object trackers. From the experimental results, we can see that the proposed multiobject tracking method improves the robustness and accuracy of the algorithm.

Conflicts of Interest

The authors declare no conflict of interest.

Authors’ Contributions

Ming-xin Jiang, Chao Deng, and Zhi-geng Pan conceived and designed the experiments; Lan-fang Wang, and Xing Sun performed the experiments; and Ming-xin Jiang wrote the paper.

Acknowledgments

This work was supported by the National Key R&D Project under Grant no. 2017YFB1002803, the National Natural Science Foundation of China under Grant no. 61332017, the Six Talent Peaks Project in Jiangsu Province under Grant no. 2016XYDXXJS-012, the Natural Science Foundation of Jiangsu Province under Grant no. BK20171267, the 533 Talents Engineering Project in Huaian under Grant no. HAA201738, and a project funded by the Jiangsu Overseas Visiting Scholar Program for University Prominent Young & Middle-Aged Teachers and Presidents. This work also received support from the Major Program of the Natural Science Research of Jiangsu Higher Education Institutions of China (18KJA520002), a project funded by the Jiangsu Laboratory of Lake Environment Remote Sensing Technologies (JSLERS-2018-005), and the fifth issue 333 high-level talent training project of the Government of Jiangsu Province (BRA2018333).