Abstract

Speed and accuracy are important parts of the human tracking system. To design a system that tracks the target human working well in real time, as well as on mobile devices, a tracking real-time target human system is proposed. First, real-time human detection is performed by the combination of MobileNet-v2 and single-shot multibox detector (SSD). Subsequently, the particle filter algorithm is applied to track the target human. The proposed system is evaluated with the different color shirts and complex background conditions. In addition, the system also works with the support of a depth Kinect-v2 camera to evaluate performance. The experiment result indicates that the proposed system is efficient without the impact of colors, background, and light. Moreover, the system still tracks the human when the human has disappeared or the size of the target has a significant change, and an FPS of 12 (Kinect-v2 camera) and 22 (conventional camera) ensures the system works well in real time.

1. Introduction

In recent years, human tracking has become an important part of the computer vision field. This algorithm is used to estimate the position of the target humans in a video sequence [1]. Human tracking has a wide variety of applications in many areas such as self-driving cars, the military, robotics, and many others. The development of deep learning has brought more effective object tracking algorithms. The requirements of the human tracking algorithms are working in total occlusion, complex environments, and so on [2, 3]. They also have to work on mobile devices without changing performance, particularly when deep learning algorithms are required to work in real time. Thus, human tracking is still a challenge that needs to be solved.

Through decades, many algorithms have been proposed to solve tracking problems such as real time, disappear, size change, and accuracy. For example, the CAMshift [4] algorithm is proposed to find the original color of objects and track them. This algorithm has low computation and is easy to implement. However, the algorithm is impacted by similar background colors or texture changes a lot [5]. The Kernel-correlated filtering (KCF) [6] is a good tracking algorithm. KCF possesses the qualities of great robustness, high accuracy, and high speed. However, this algorithm is unstable in illumination, background color, and shape of target human change [6]. The Kalman filter algorithm [7] predicts the position of the target based on previous movement information, but the Kalman filter often is used in a linear system. The particle swarm method with the histogram of oriented gradients (HOG) [8] is the method often used to solve the full occlusion problem. However, the disadvantage of this method is hard to track the target when the target size changes too much. The particle filter [9] is a simple and flexible method. However, when the target is partial occlusion or local appearance changes, the particle filter cannot track the target accurately.

With the standout of deep learning algorithms in recent years, deep learning architects have been used to detect and classify objects. Some methods such as Fast-RCNN [10], Faster-RCNN [11], YOLO [12], SSD [13], and RetinaNet [14] are object detection algorithms with high accuracy and high speed. These algorithms have a common characteristic in applying a convolution neural network to extract features. They offer effective processing models, extracting features automatically, and improving accuracy dramatically. The above deep learning methods require a high-performance computer because the amount of data needed to compute is very huge. For instance, the number of images for training in YOLO can be up to millions of images. If YOLO, RCNN family, and other deep learning algorithms work on a low-performance computer, they are expensive, slow for training, and too hard for real time.

For human tracking, some methods using both deep learning and particle filters are applied. For example, the YOLO-particle filter [15] is an efficient method for detecting and tracking humans, but it is difficult to run on low-performance devices. The SSD-particle filters [9] are the model with the backbone of VGG-16. This is also a good model, but the SSD model faces difficulty in predicting small object and needs a lot of data to train. In our recent paper [16], the combination of SSD-MobileNet-v2 [17, 18] and particle filter is proposed. This algorithm is applied in the case of target humans having different colors and full occlusion problems. In this paper, we expand the study to evaluate the efficiency of the proposed algorithm in a complex background, in low light, and when combined with the depth Kinect-v2 camera.

In this paper, the particle filter is combined with SSD-MobileNet-v2 to track target humans. The SSD-MobileNet-v2 is applied to determine the presence of humans in a video automatically. Then, the particle filter is adopted to track the target human. The contributions of this paper compared to existing works include the following:(1)Particle filter based [19] can only track human targets based on particles. It requires manual intervention to determine the initial position and initialize particles, resulting in time-consuming and labor intensive. To determine the initial position of the human target more accurately and adjust the target human position over time, the proposed model supports updating the size of the human target and grabbing the target automatically when the target appears in the camera area.(2)The proposed algorithm does not need too much memory and operates at high speeds without the GPU when compared with other algorithms such as VGG-16 [20], ResNet-50 [21], and YOLO [12]. The experiment shows that the proposed algorithm adapters in tracking humans with real-time speed (22 FPS).(3)Unlike other tracking algorithms face difficulties in tracking human targets in complex conditions (occlusion, scale change, color change, scene change) such as YOLOv3-Camshift [22, 23], YOLO-CSRT [24], and VGG16-KCF [25], the proposed system solves full occlusion even if there is a huge change in the size of the target or obscured by a similar human target. Our system also runs well in complex backgrounds and different light conditions.(4)The proposed model is tested with a Kinect-v2 camera and compared the results with results on a conventional camera. The experimental results show that the proposed model with Kinect-v2 camera is robust in low light conditions and reaches higher accuracy.

The structure of the sections of this paper includes: (1) the pros and cons of traditional algorithms, deep learning algorithms, and the combination of these two kinds of algorithms to achieve a higher efficiency system; (2) describe in detail SSD-MobileNet-v2 and particle filter and their application of them in human tracking; (3) discuss the achieved result of the algorithm; and (4) show the conclusion and give out the task that needs improvement in the future.

2. Materials and Methods

2.1. System Architect

In the literature review, to improve the performance of the particle filter for human tracking in complex conditions, the particle filter is combined with other algorithms such as Mean-Shift [26], Kalman filter [7], histogram [27], SURF [28], and so on. For example, regarding figures in a study by Iswanto and Li [29] and Lin et al. [30], the common characteristics of these algorithms are facing problems such as feature extraction automatically, real time, scale variation, scene change, similar appearance, cluttered background, and so on. In this paper, to overcome the drawbacks of the above algorithms and increase the accuracy in human tracking, the SSD-MobileNet-v2 combined with the particle filter is proposed. SSD-MobileNet-v2 is used to detect target humans automatically, and the particle filter is used for tracking the target human.

The flowchart of the proposed algorithm is shown in Figure 1. First, the SSD-MobileNet-v2 model is used to detect humans. If there is a human target, a bounding box is created for the target, and at the same time ROI is created. In the next stage, the HSV histogram of ROI is analyzed. The particle filter algorithm initializes randomly 500 particles for the bounding box of the human target. SSD-MobileNet-v2 detects all humans in the next frames and then updates the state of each particle. After that, the distances of each particle are calculated based on the HSV histogram and weight. Next, the algorithm estimates the new state of particles based on their weight and uses it as a centroid of predicted the bounding box. Intersection over Union (IOU) values between the predicted bounding box and the other bounding boxes are computed. The bounding box with the highest IOU value is kept. Then, to get the accurate trajectory of the target, the Kalman filter is applied to the centroid of the human target bounding box. Based on value predictions and the new location of the center, the Kalman filter method calculates the center’s location. If the distance of all particles is higher than a certain threshold, the algorithm resamples particles having low weight, else reinitialization of all particles. Finally, the proposed system ends if it is a last frame; otherwise, the system repeats from the process of human detection to the last frame.

2.2. SSD-MobileNet-v2 Algorithm

To detect the human target, a deep learning model with six layers is used with the base network MobileNet-v2 [31]. MobileNet-v2 is selected as it can be easily implemented on mobile devices. This model has an architect inspired by the single-shot multibox detector (SSD) model with some modifications to be compatible with MobileNet-v2. Instead of using normal convolutions, separable depthwise convolutions are used to reduce the number of parameters and improve computation time [16, 32].

The proposed method has a base network MobileNet-v2 and SSD extra feature layers. This system uses input images size of 300 × 300 × 3. Base network MobileNet-v2 extracts high-level features from input images. The size of the feature map is reduced, while the algorithm can detect an object at various scales by adding extra layers [16, 33]. Figure 2 shows the architecture of the SSD-MobileNet-v2 algorithm.

The process steps, as shown in Figure 2, are given as follows:(1)Step 1: Read the input images size 300 × 300 × 3.(2)Step 2: The base network is MobileNet-v2 without fully connected layers. This network is used to extract features of images with output size 38 × 38 × 512.(3)Step 3: Apply convolutional layer size 3 × 3 × 1,024 for the previous feature map, and the output feature map is obtained with size 19 × 19 × 1,024. At the same time, a classifier with a convolutional filter 3 × 3 is applied to detect objects on the feature map.(4)Step 4: Apply the same process in step 3 for the other feature map, and the feature map size is 19 × 19 × 1,024, 10 × 10 × 512, 5 × 5 × 256, 3 × 3 × 256, and 1 × 1 × 256, respectively. The shape of each feature map is based on a convolutional process in the previous layer. A classifier is also used in feature map sizes 19 × 19 × 1,024 and 3 × 3 × 256.(5)Step 5: Nonmax suppression algorithm is used to eliminate duplicate detections and select the best bounding box out of a set of overlapping boxes.

In MobileNet-v2, inverted residual and linear bottleneck block [34] are the new layers enabling the model to work well on mobile devices. An inverted residual block is created to reduce the number of parameters compared with the original residual block [35]. The structure of the inverted residual block is built opposite of the original one. First, the input image is widened using 1 × 1 convolution. This convolution layer expands input feature maps suited to nonlinear activations. Next, a 3 × 3 depthwise convolution is performed to reduce the number of parameters. Finally, the 1 × 1 convolution is used again to squeeze the network, so that the input image can be matched with the initial number of channels [36]. The batch normalization [37] and ReLU6 [37, 38] are added following each convolution block to reduce the training process of deep neural networks and generalization error [39]. ReLU6 activation is used to discard nonlinear. This activation has ranging linear values between 0 and 6. The structure of the bottleneck block is shown in Figure 3.

This system uses a feature map of the last few layers to predict the location and class of objects and then the output is passed to various convolution layers. Through the network, the size of these layers gradually decreases. In the final, the predictions are combined from each of these convolutional layers. The system outputs the locations and confidence of an object. Each location is evaluated by loss value. Localization loss is defined as follows:

Smooth L1 loss can be interpreted as a combination of L1 loss and L2 loss, as follows:

The localization loss is only computed for positive matching () between the predicted bounding box p and the ground truth bounding box g. (x, y) is the center of bounding box, (w, h) is width and height of a bounding box, and d is default bounding box. is the total distance between the predicted box p and ground truth box g, is the rating for matching between the default bounding box i and the ground truth box j for label , are the center of the predicted box compared with the center of the default box , and are scale values of the width and height of the predicted box compared with the center of the default box .

Confidence loss is defined by the equation as follows:where c is the predictions for the probabilities of belonging to different object classes, is the rating for matching between the default bounding box i and the ground truth box j for label , is the predicted confidence score of class k for the ith ground truth object, and is the predicted confidence score of the class 0.

The final loss function is computed as follows:where N is the number of default boxes that match the ground truth, is confidence loss, and is location loss.

After prediction, there are a lot of bounding box predictions overlapping. To discard superfluous bounding boxes, the nonmax suppression algorithm is applied [31]. Then, all boxes have confidence and IOU value below a probability bound, these boxes are discarded.

2.3. Particle Filter Tracking Algorithm

A particle filter is used for tracking the human target [40]. This method finds the target human position by using random particles. Then, particle weights are computed for each particle based on its accuracy. The probability value to determine the actual target position is described by these particles and their weights in a region of state space [40]. Each particle is defined by:where (x, y) are the coordinates of the center of rectangle boxes and are the velocities.

The initial step defines N as the number of particles. All particle coordinates are chosen at random within the bounding box of the human target.with .

A linear differential equation is used to update all particles in each frame as follows:where is the state of each particle previously, is an array of Gaussian random variables, and A is the transition matrix given as follows:

Each particle weight is used to evaluate its effectiveness. By comparing the similarity between the HSV histograms of the rectangle box on the particle and the template bounding box on the human target, this weight is evaluated. This similarity is calculated by using the Hellinger distance as follows [41]:where H1 and H2 are two HSV histograms which are compared and M is the total number of bins in a histogram.

In this paper, the HSV histogram is presented in 8 × 8 × 4 bins to get the best results. Each particle’s weight is calculated using distance values determined by the following equation as follows:where is the standard deviation of and is the particle’s previous weight.

The following step is necessary to normalize these weights to evaluate which has a higher chance of appearing. This is the final estimated state given as follows:where is the state of the ith particle at t and is the weight of the ith particle at t.

Particles are sampled again in every new frame [42]. Just particles having low distances are kept and all others are discarded. The weight value is assessed by a threshold and in this paper, the weight threshold is 0.3. Then, new particles surround some particles with the highest weights. All particles are randomly reinitialized if all distances exceed the threshold or the entire human target is obscured.

2.4. Update Size of Target

IOU equation is applied for each bounding box created by the proposed model to calculate the overlap rate of each bounding box (created by the particle filter). Those overlap rates are compared to each other to choose the highest value. The bounding box having the highest rate is used for updating the true size of the target.

Figure 4 shows the process of using IOU to update the size of the target. Figure 4 shows an example of applying IOU to update the size of the target. The green bounding boxes are obtained from the proposed method. The yellow box is a bounding box created by the particle filter with the prior size of the target. The red box is the bounding box of the target after the update.

2.5. Kalman Filter for Accuracy Trajectory

To get the accurate trajectory of the target, the Kalman filter is applied to the centroid of the human target bounding box. Based on value predictions and the new location of the center, the Kalman filter method calculates the center’s location [43]. Prediction and correction are divided into two main stages.

The state variable is denoted as , and measurement variable is denoted as . The state variable is updated in each frame based on equation motion without acceleration as follows:where A is the transition matrix and is the value prediction of

The predictor covariance equation is given as follows:where is the value prediction of covariance and Q is the interference factor.

Kalman filter uses three equations in the correction stage, including the Kalman gain equation, state update equation, and covariance update equation. Kalman gain equation is used to correct the stage estimate and covariance estimate. This equation is performed using the following formula as follows:where H is the measurement matrix and R is the measurement noise covariance matrix.

Based on Kalman’s gain and prediction the state the estimated state of the center point can be calculated as follows:

The covariance also is updated by the following equation as follows:

2.6. Filter Background with an RGB-D Camera

The background significantly influences the accuracy of the system. Therefore, the elimination of the background is an essential step. The removal of the background can be facilitated through the utilization of the RGB-D camera Kinect-v2, as shown in Figure 5. Assume that is the distance from the Kinect-v2 camera to the human target. The value plays the most important role in determining the range for the Kinect-v2 camera. However, this value is not calculated in the current frame. is used to compute with the velocity of the target.

Tbelow and Tabove are below and above the threshold.

In the next sections, the structure of the sections of this paper includes section 3: showing the results of the proposed algorithm and comparing their performance with other algorithms and section 4: showing the conclusion and giving out the task that needs improvement in the future.

3. Results and Discussion

3.1. Training Deep Learning Model

To train the deep learning model for human detection, 2,500 images are downloaded from Google and 3,500 images are captured by our camera. Transfer learning method and pretrained are used to reduce training time on the COCO dataset combined with our dataset. Training parameter value of SSD-MobileNet-v2 and MobileNet-v2 is shown in Table 1.

3.2. Running System and Evaluating Results

The proposed system is performed on a computer with 16 GB RAM, Intel Core i7-4800MQ CPU 2.7 GHz x8, and a camera (2 Mpx). To improve processing speed, all images are resized to size 300 × 300. The proposed algorithm is executed on Python 3.7, OpenCV 4.4 library, and NumPy Library. In addition, the proposed system is also evaluated with the support of a 2 Mpx Kinect-v2 camera (max 30 FPS). When the system uses the Kinect-v2 camera, the RGB image is calibrated and resized to 270 × 520 to combine with the depth image.

SSD-MobileNet-v2 model is used to detect target humans and then output information is used by the particle filter tracking algorithm. Depending on the number of particles, the proposed system has a different accuracy. The higher the particles, the better accuracy is received while the number of missing frames is reduced. However, the result is still the same once the number of particles reaches a certain threshold. As shown in Table 2, the accuracy is still 97.4% even when there are 1,000 particles.

By comparing our method with the particle filter method [44], the proposed system still tracks the target human even in case the size of the target has greatly changed. Furthermore, by updating the size of the target, the bounding box of the target is extracted perfectly. The experiments are shown in Figure 6, the light blue bounding box is created by SSD-MobileNet-v2, the red color is created by the particle filter, and the purple color is created by MobileNet-v2.

The algorithm is also tested with some different colors of shirts and different backgrounds to verify the performance. Figure 7 shows the experimental results in the case of the target human using the black and blue color shirt. Yellow bounding boxes are drawn for each human in the frames, and blue boxes are drawn for the human target. The yellow points are particles for tracking the human. The results indicate that even with the same background color, the algorithm remains efficient.

Additionally, to evaluate the effectiveness of the proposed system with support for the Kinect-v2 camera, a variety of colored shirts and low-light conditions are applied. Figure 8 shows the test experiments with blue shirts and yellow shirts. As shown in Figure 8(a), the proposed system is evaluated in low-light conditions with the color of the background and shirts being similar. The yellow bounding boxes are humans who are detected by SSD-MobileNet-v2, and the blue bounding box is the target human who is selected for tracking. The results show that in low-light conditions, the algorithm is still efficient in tracking the human target. Figures 8(b) and 8(d) show RGB images, and Figures 8(c) and 8(e) show masks of color created by using depth images and RGB images of Kinect-v2 camera. The yellow points, as shown in Figures 8(c) and 8(e), are particles used to find the position of the human target.

The accuracy and speed of three algorithms are compared in performance, including particle filter [44], particle filter with MobileNet-v2 [45], and the proposed algorithm. The evaluation involved 458 frames with 500 particles. The FPS of the particle filter algorithm is 38. This is the highest FPS in the three algorithms, but the accuracy only occupies 78.60%. The other algorithms have an FPS range of 21–22, except for the proposed system using a Kinect-v2 camera (12 FPS). However, the proposed system using the Kinect-v2 camera has the highest accuracy (96.94%) followed by the proposed system using a conventional camera (96.06%). The next is a particle filter-MobileNet-v2 [45] with an accuracy of 94.54%. The proposed system has an FPS of 22 (conventional camera) and 12 (Kinect-v2 camera), so the system still works well in real time. The comparison result is shown in Figure 9.

4. Conclusions

This paper presents a method for human tracking by applying a particle filter and SSD-MobileNet-v2 model. The experimental result shows that the proposed system tracks the human in case of the same color, disappearing problem, or the size of the target has a big change. In addition, by using the depth Kinect-v2 camera, the system works better in low-light conditions. By testing and comparing our method with the particle filter and particle filter-MobileNet-v2 algorithm, our method performs better than these algorithms. The accuracy of the proposed algorithm has been greatly improved compared to using only a traditional algorithm particle filter. The tracking speed is 12 (Kinect-v2 camera) and 22 FPS (conventional camera) that are enough for real time. Speed for tracking target humans and tracking more targets at the same time are contents of works to be done in the future.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.