Abstract

You only look once (YOLO) is one of the most efficient target detection networks. However, the performance of the YOLO network decreases significantly when the variation between the training data and the real data is large. To automatically customize the YOLO network, we suggest a novel transfer learning algorithm with the sequential Monte Carlo probability hypothesis density (SMC-PHD) filter and Gaussian mixture probability hypothesis density (GM-PHD) filter. The proposed framework can automatically customize the YOLO framework with unlabelled target sequences. The frames of the unlabelled target sequences are automatically labelled. The detection probability and clutter density of the SMC-PHD filter and GM-PHD are applied to retrain the YOLO network for occluded targets and clutter. A novel likelihood density with the confidence probability of the YOLO detector and visual context indications is implemented to choose target samples. A simple resampling strategy is proposed for SMC-PHD YOLO to address the weight degeneracy problem. Experiments with different datasets indicate that the proposed framework achieves positive outcomes relative to state-of-the-art frameworks.

1. Introduction

Learning-based detection algorithms have proven important in several subject areas, including smart surveillance systems [1], wireless sensors [2, 3], and secure transportation systems [4]. Over the past several years, convolutional neural networks (CNNs) have achieved excellent results in multiple computer vision assignments. You only look once (YOLO) is an effective visual detection method [5]. Compared with other detection networks, the YOLO network can predict class probabilities and bounding boxes in an assessment directly from the input frame. YOLO detectors, however, are taught with annotated datasets and utilized to attain the highest variability of the target. The distribution of the target captured by the camera may not be a subset of the initial learning set when these detectors are applied to a specific scene, such as in the case of a closed-circuit television (CCTV) camera. Therefore, the resulting Generic YOLO detector may not function effectively, especially for a limited amount of training data [6].

To address this problem, transfer learning with cross-domain adaptation is proposed. A specific training dataset is needed to generate a specific detector. Normally, these positive samples of the specific training dataset are manually selected from the target dataset. However, a large amount of labelled data is needed to tune the detector in each frame, and labelling is a labor-intensive task. A typical solution for reducing the collection time is to automatically provide the sample labels with the target frame. Labelled samples are iteratively collected from the unlabelled sequence and added to the training dataset [7].

We propose a novel transfer learning method with a probability hypothesis density (PHD) filter, which can automatically retrain a YOLO network for a special object. The scene-specific detector is generated with a Generic YOLO detector trained by labelled frames and sequences without labelled information. The parameters of the YOLO detector are estimated by an iterative process. After automatic and iterative training, the final specialized YOLO detector is produced and can run without the SMC-PHD filter. Figure 1 illustrates the structure of our method.

Although improving the YOLO with the SMC method has been employed for transfer learning [8], the detection probability and clutter density are not considered in the target sequence. In the updated step of our proposed method, the occluded targets are selected and collected as positive samples for training. The primary benefit of our method is that the recognition model can learn the appearance of occluded targets and clutter. As shown in the experimental results in Section 4, our proposed SMC-PHD YOLO can detect some occluded speakers with the SMC-PHD filter-based occlusion strategy, while the SMC Faster region-based CNN (R-CNN) [8] cannot detect the occluded targets. In addition, when positive samples are collected, some false samples (clutter) may be added to the positive training dataset. The performance of the SMC Faster R-CNN [8] would be affected by the clutter. When there is clutter in the training dataset, the SMC Faster R-CNN produces false detection. Based on the clutter density, this clutter would be assigned a low weight, and our proposed method could disregard false samples. Our proposed PHD YOLO network has four main contributions:(i)To address the bias between the training dataset and target set, we propose a PHD based transfer learning method for YOLO. For nonlinear tasks, a scene-specialized multitarget detector, SMC-PHD YOLO, is proposed. For linear systems and Gaussian noise tasks, we extend our method to GM-PHD YOLO to eliminate concerns about SMC dependence.(ii)In SMC-PHD YOLO, we show that the detection probability and clutter density of the SMC-PHD filter improve the performance of the retrained YOLO networks for the occluded targets and multiscale targets. When the image quality of the target scenes is unsatisfactory, even with noise, the specialized YOLO network can still detect the target with the posterior density.(iii)A novel likelihood is proposed to verify the selected samples in PHD YOLO. To collect positive samples for training, the confidence probability of the YOLO detector and visual context indications are applied.(iv)For the weight degeneracy problem of SMC YOLO, we also propose a novel and simple resampling strategy that can collect samples from the target sequence based on their weights, and the proposed distribution is assumed to be the target distribution. With the detection distribution, the strategy can function effectively even when a small number of samples is employed.

The remainder of this document is structured as follows: Section 2 introduces the current approach applied in this sector and offers details regarding the benefits of our proposed method over other specialization methods. Section 3 describes our proposed strategy in detail. Section 4 details the configuration of the simulation and presents experimental outcomes, and concluding comments are provided in Section 5. We adhere to the convention that scale variables, such as confidence, are presented in lowercase italics, e.g., . Symbols for vector-formed states and their densities are shown in lowercase bold italics, e.g., , and multitarget states are represented by uppercase bold italics, e.g., . Uppercase nonbold letters represent polynomials. Symbols for matrices, such as the transition matrix, are shown in uppercase bold letters, e.g., .

2. Background

2.1. Specialization Frameworks

If the distribution of the training samples is different from that of target scenes, then a traditional visual detector may not function effectively [9]. To address this problem, specialization frameworks are utilized to automatically create scene-specific detectors for a target scene. Transfer learning algorithms based on state-of-the-art theories use the annotated model and expertise gained through prior assignments. There are three main types of transfer learning methods [10]. First, by changing the parameters of the source learning model, the model is improved in a target domain [11, 12]. Second, the variation between the source and target distributions is decreased, and the source learning model is adapted to the target domain [13, 14]. Third, the training samples are manually or automatically chosen, and the model is retrained with a subset of selected samples [15]. We focus on the third category because it can automatically label the selected samples and the training parameters remain unchanged.

However, the new training dataset may contain some incorrectly labelled samples because the labels of the samples are not manually verified. With this type of dataset, the accuracy of the detection framework may decrease. To address this problem, various contextual indications, such as the visual appearance of objects, pedestrian movement, road model, size, and place, are used to verify favourable samples for retraining the training dataset; however, this method is sensitive to occlusion [16]. Moreover, some techniques may only use samples from the target domain and waste helpful samples [17]. Htike and Hogg employed a background subtraction algorithm to train a particular detector [9] to select the target samples from the source and target datasets. To automatically label target information, tracklet chains are utilized to link the proposed samples to tracklets [15] predicted by an appearance-target detector. However, for each target scene, this framework, which includes many manual parameters and thresholds, may affect the specialization performance. Alternatively, Maâmatou et al. [10] collected fresh samples. To train a fresh dedicated retrained sensor, an SMC transfer learning method was employed to create a new dataset [8].

2.2. YOLO Network

In this work, we used the YOLO (V3) network [5] since it passes the image only once into a fully CNN (FCNN), which enables it to achieve real-time performance. YOLO (V3) was developed based on YOLO [18] and YOLO (V2) [19]. The YOLO network considers the detection problem as a regression problem. Therefore, the network directly generates a bounding box for each class via regression without any proposal region, which decreases the computational cost compared to Faster R-CNN.

The YOLO detection model is shown in Figure 1, where the network divides each input image of the training set into grids. When the grid is filled by the centre of the target ground truth, the grid is used to detect the object. For each grid, several bounding boxes and their confidence scores are predicted. The confidence is defined as

If the target is in the grid, ; otherwise, . (intersection over the union of the prediction and ground truth) is used to present the coincidence between the predicted bounding box and the reference bounding box, which indicates whether the grid contains targets. If several bounding boxes detect the same target, then nonmaximum suppression (NMS) is applied to select the best bounding box.

YOLO has a lower computational cost than Faster R-CNN; however, it has more errors. To address this problem, YOLO uses the “anchor” of the Faster R-CNN to generate suitable prior bounding boxes; YOLO uses k-means cluttering. The adoption of the anchor boxes decreases the mean average precision (mAP). In addition, unlike YOLO, YOLO-V3 uses batch normalization, multiscale prediction, a high-resolution classifier, dimension clutter, direct location prediction, fine-grained features, multiscale training, and other methods that greatly improve the detection accuracy.

2.3. Random Finite Set and PHD Filters

In this subsection, we discuss the random finite set and PHD filters for scene-specialized transform learning. The probability hypothesis density and random finite set are proposed for multitarget tracking [2022]. The random finite set is a flexible algorithm that can be combined with any object detector to generate positional and dimensional information on objects of interest. Maggio et al. used detectors such as background subtraction, AdaBoost classifiers, and a statistical change detector to track objects associated with a random finite set (RFS) [23, 24]. For handling occlusion problems during tracking, Kim et al. proposed the labelled RFS [25]. As the RFS is a computationally expensive approximation of the multidistribution Bayes filter, the PHD is the first-order moment of the RFS, which is a set of random variables (or vectors) with random cardinality [20]. An alternative derivation of the PHD filter based on classical point process theory was given in [26]. In multitarget research, the Gaussian mixture PHD (GM-PHD) filter [27] and SMC-PHD filter [28] are widely utilized. The GM-PHD filter is a closed-form solution, as it assumes that the model is linear and Gaussian. By limiting the number of considered partitions and possible alternatives, Granstrom et al. proposed a GM-PHD filter for tracking extended targets [29]. Since different objects have different levels of clutter, an N-type GM-PHD filter was proposed for real video sequences by integrating object detector information into this filter for two scenarios [30]. However, the accuracy may decrease for nonlinear problems. To address nonlinear problems, the SMC-PHD filter was proposed based on the Monte Carlo method. With the weights of the samples (particles), the SMC-PHD filter can track a varying number of unknown targets.

The PHD filter is defined as the intensity , which is applied to estimate the number of speakers. The PHD filter involves a prediction step and an update step that recursively propagates the intensity function. The PHD prediction step is defined aswhere is the target bounding box state. is the intensity of the birth RFS. is the analogue of the state transition probability,where is the survival probability and is the transition density. is the intensity function of the spawn RFS with the previous state . The PHD update equation is given aswhere is the likelihood defining the probability of given . is the detection probability. The intensity of the clutter RFS is shown as , where is the average number of Poisson clutter points per scan and is the probability distribution of each clutter point. The PHD recursion involves multiple integrals in equations (2) and (4), which have no closed-form solution in general. To address this issue, the SMC-PHD filter has been proposed and widely utilized [28]. In the SMC-PHD filter, at time , the target PHD is represented by a set of particles, , where is the number of particles at . To the best of our limited knowledge, this article is the first study to use the PHD filter to train a scene-specialized, multitarget detector. As the number of targets is unknown in our unlabelled dataset and the sample collection is nonlinear and non-Gaussian, the SMC-PHD filter is applied to collect the unlabelled training data and customize the YOLO network.

3. Proposed Framework

This section introduces our proposed framework, which customizes the YOLO model based on the PHD filter. The PHD filter is used to label the target in unlabelled videos based on the YOLO output. The positive samples estimated by the PHD filter are used to build a new custom dataset. The YOLO network is fine-tuned on this custom dataset, which may contain occluded targets and targets of different styles. Since the number of unlabelled videos is large, the bias between the training dataset and the real data decreases. Compared to the state-of-the-art method, our proposed framework is not sensitive to occlusion and target shape. The overall framework of the proposed method is shown in Figure 2.

To be more specific, assume that a Generic YOLO network is trained with generic datasets, such as Common Objects in Context (COCO) [31]. For the target sequence, unlabelled frames are represented as , where is the index of the frame. The detection output of at frame is . is a detection set at frame , where is a bounding box state of the detected target. is the index of the detected target, and is the number of detected targets. Furthermore, the PHD filter updates to the estimated target state . is an estimated target set, where is the number of estimated targets at and is the index of the estimated targets. Note that is not equal to . The PHD filter removes some clutter from and adds some missed targets. The images with an estimated target bounding box set are applied to fine-tune the YOLO network. The fine-tuned YOLO is referred to as , where is the time of fine-tuning. The training pipeline of the PHD YOLO detector can be found in Figure 3.

The challenge is how to select the samples with the SMC-PHD filter. In this section, the iterative process is divided into three steps: prediction, updating, and resampling. In the following subsections, the details of the three primary steps are outlined. Since the SMC-PHD filter is more robust than the GM-PHD filter in the tracking task, PHD YOLO is mainly implemented as an SMC-PHD filter. To extend our proposed method to linear systems, GM-PHD YOLO is briefly discussed at the end of this section.

3.1. Prediction Step

To build the custom dataset, , several particles are applied. At frame , particles are represented as , where is the particle weight. Our work considers only two kinds of particles: survival particles and birth particles. The spawn particles of the SMC-PHD filter are disregarded. For survival particles, the particle state is calculated by the transition function :

For birth particles, the particle state is normally set in the tracking area. The particle weight is calculated by

However, if the new birth particle is located near the survival particles, then one target is repeatedly estimated by survival particles and birth particles. Thus, the number of targets would exceed the ground truth. To address this problem, we propose a novel birth density function based on the target state history:wherewhere is the survival probability and is the birth probability. represents the probability that the sample still exists. When , a sample still exists in the new dataset. When , samples are resampled, and samples in different iterations are independent.

3.2. Update Step

In the update step, the particle states are further updated according to the output of YOLO, . The update step of the PHD recursion is approximated by updating the weight of the predicted particles when the likelihood is obtained. The predicted weights are updated aswhere

The detection probability is simplified as in our following work. The number of targets is estimated as the sum of the weights, .

To ignore the clutter, the clutter density function is applied, and the value of is varied for the different detections . indicates the level of clutter and is a set value. When has a high probability of being cluttered, is a high value. If the detection is not cluttered, then is given as 0. Normally, is set as a constant or estimated by the Beta-Gaussian mixture model [32].

is the detection probability, which is chosen based on the sample and can be estimated by the Gaussian mixture model [32]. If the sample is occluded, then would have a low value (near 0). Therefore, the occluded samples have high weights and are selected for retraining the YOLO network. If the sample is not occluded, then is equal to 1, and the value is not changed.

3.3. Likelihood Function

In addition to the detected probability and clutter density, the likelihood density determines whether the sample is selected for retraining. Samples with high weights are employed to retrain the YOLO network, while samples with low weights are disregarded. The likelihood density is applied to represent the relationship between the detections of the YOLO network and the samples. Therefore, we define the likelihood aswhere

During the iterative process, is decreased. When the selected sample applied to retrain the YOLO detector has a high associated score, the sample likelihood is maximized. The confidence scores are provided by the YOLO network output layer. When , the weight of the sample is set to 0, and the sample is removed from the specialized dataset. indicates whether the sample was detected by the YOLO network. For visual cues, we calculate the Euclidean distance between the selected sample and the previous sample .wherewhere is the state of the detection . To select high-score samples , we use a dynamic threshold:where and are the target class label s calculated by . is the associated score, and is the initial threshold.

3.4. Resampling Step

The SMC-PHD filter is utilized to construct a new, specific dataset for retraining, according to the resampling approach, in which resamples from the weighted dataset are included in the generated dataset . However, the traditional SMC-PHD meets the weight degeneracy problem and the number of samples decreases during the retraining step. To generate a new, unweighted dataset with the same number of samples as the weighted dataset, a sampling strategy is employed. Moreover, the effective sample size (ESS) of is calculated:

When the ESS is greater than 0.5, the particles can be considered to be positive samples for the special training dataset. When the ESS is less than 0.5, the particles should be resampled via the Kullback–Leibler distance (KLD) sampling [33]:

An extra k-means method is used to estimate based on the particles . Note that the aspect ratio of the positive training sample may differ from the initial anchors , as we use the IoU overlap as the positive sample. We employ the k-means method to cluster the aspect ratio of samples to update the anchors. To decrease the computational cost, only three anchors are used to retrain the YOLO network; they are set to . These proposals are employed to retrain the YOLO network, which is produced by fine-tuning the specific dataset. In the next iteration, these networks will become the input of the forecast phase and be used to create target proposals (bounding boxes) in the target scene.

3.5. GM-PHD YOLO

SMC-PHD is mainly discussed and applied to improve the YOLO network since it is more robust than the GM-PHD filter for nonlinear systems. However, for linear systems, the GM-PHD filter can provide a higher accuracy rate than the SMC-PHD filter. Therefore, in this subsection, we briefly discuss how to use the GM-PHD filter to improve the YOLO network. The pipeline of the GM-PHD YOLO is similar to that of SMC-PHD YOLO. YOLO is pretrained on the generic dataset, and GM-PHD assists in building the custom dataset from the unlabelled target sequences. YOLO is fine-tuned on this custom dataset. When the GM-PHD filter selects the samples, the steps include the prediction step, update step, and pruning.

In the GM-PHD filter, is distributed across the state space based on Gaussian density , where and are the mean and variance, respectively. In the prediction step, for existing targets, ( and ) are predicted as and , respectively, where is the transition noise variance. Their weight is calculated as . Birth targets are randomly chosen in the tracking area. In the update step, for undetected targets, the mean and variance retain their values, and their weights are calculated as . For detected targets, the mean is calculated as

The variance is updated as

The particle weight is updated as

The weight is normalized as

A simple pruning procedure is further employed to reduce the number of Gaussian components. The high weight targets are set to and are utilized to build the custom dataset.

4. Experimental Results

This section introduces the test results obtained on several public and private datasets. First, the implementation details of our proposed method are given. Second, the dataset and baseline algorithms are introduced. Third, the ablation study of the SMC-PHD YOLO filter is discussed. Our proposed SMC-PHD YOLO detector and several baseline methods are compared.

4.1. Implementation Details

The initialized YOLO in our proposed SMC-PHD YOLO filter is pretrained on the COCO dataset [31]. The Adam optimizer is applied, where the weight decline is 0.0005 and the momentum is 0.9. Although the transition matrix F differs substantially across the different object classes in the different datasets, to simplify the problem, we assume F to be

The YOLO network is fine-tuned on our evaluation dataset for the different tasks with the help of the SMC-PHD-based transforming method. The YOLO detector is tuned with a 64 GB NVIDIA GeForce GTX TITAN X GPU.

4.2. Evaluation Methodology and Dataset

We train the YOLO detector on a training collection containing 80k training frames and 500k example annotations from the COCO dataset, which contains 2.5 million labelled instances among 328k images of only 91 objects. Although the COCO dataset does not contain continuous frames, it is only used to pretrain the YOLO network before the experiments. In the evaluation step, datasets should contain continuous frames. The evaluation was performed with three different datasets.

GOT-10k [34] is a large-scale, visual dataset with broad coverage of real-world objects. It contains 10k videos of 563 categories, and its categories are more than one order of magnitude wider than those of counterparts of a similar scale. Some of its categories are not included in the COCO dataset. Therefore, GOT-10k is suitable for fine-tuning the YOLO network pretrained on the COCO dataset. The annotations that we tested include birds, cars, tapirs, and cows. YouTubeBB [35] is a large, diverse dataset with 380,000 video sections and 5.6 million human-drawn bounding boxes in 23 classifications from 240,000 distinct YouTube videos. Each video includes time-localized, frame-level features, so classifier predictions at segment-level granularity are feasible. The annotations that we tested include cars and zebras. In the MIT Traffic dataset [36], a 90-minute video is provided. A total of 420 frames from the first 45 minutes are employed for specialization, and 420 images from the last 45 minutes are utilized for testing. The video was recorded by a stationary camera. The size of the scene is 720 by 480, and it is divided into 20 clips. The annotation that we tested includes only the cars. False-positive curves per frame (FPPI) and receiver operating characteristic (ROC) curves are used to evaluate our proposed detector and baseline methods. The pipeline of the data preparation for the PHD YOLO experiment is shown in Figure 4.

4.3. Baseline Method

The algorithms compared with the SMC-PHD YOLO algorithm are Generic YOLO [5], Generic Faster R-CNN [37], SMC Faster R-CNN [8], that of Singh et al. [38], that of Deshmukh and Moh [39], that of Kang et al. [40], that of Maâmatou et al. [10], spatiotemporal sampling network (STSN) [41], salient object detection (SOD) [42], that of Lee et al. [43], that of Jie et al. [44], and that of Ghahremani et al. [45]. Table 1 shows the comparison between baseline methods and our method. The detector pretrained on the general dataset is presented in the second column. Some methods automatically fine-tune the network with the target dataset collected by the methods shown in the third column. For example, the algorithm of Kang et al. [40] does not include a fine-tuning step, and there is no information in its block. The computational complexity of fine-tuning with the target dataset is shown in the last column, where is the number of frames in the video, is the number of particles for the SMC method, is the average number of targets in each frame, is the size (length ∗ width) of the frame, and is the number of auxiliary networks.

4.4. SMC-PHD Filter YOLO for Multitarget Detection

In this subsection, we discuss the contribution of the SMC-PHD in our proposed method via three experiments. In these three experiments, we evaluate the performance of the detection probability and clutter density. Note that for a fixed label dataset and fixed YOLO, these parameters are also fixed and can be measured from the dataset. To show the contribution of the detection probability and clutter density, we set different values in the experiments.

4.4.1. Detection Probability

To evaluate the detection probability performance, we set the detection probability as different constants. The detection probability in the SMC-PHD is incrementally increased from 0 to 1, and six situations are considered: 0, 0.2, 0.4, 0.6, 0.8, and 1. The YouTubeBB dataset is selected since it includes several situations. For example, the vehicles in traffic videos are frequently occluded by other vehicles, while airplanes at an airport always appear in the scene.

Table 2 shows the FPPI of the SMC-PHD YOLO network versus the detection probability and category. A correctly estimated detection probability can produce a high FPPI. For example, since the airplanes are always shown in the centre of the scene in the airplane sequences, the lowest FPPI for the airplane category is . The best results for the car category are due to the occluded cars. Therefore, if targets are frequently occluded, then the detection probability should be of high value. Furthermore, for the airplane category, the FPPI at is only 85% of that at . Thus, if the detection probability is too high, such as 1, then the FPPI of the detection would decrease.

4.4.2. Clutter Density Function

The clutter density function is employed to address the clutter problem. For the PHD filter, the clutter density function is varied based on the detection results, and it is given a constant value in many references [26, 28, 32, 47]. In these experiments, clutter density is a constant value for all detections. However, a large may decrease the weights of the targets, which causes an insufficient number of samples to be included in the training dataset. A low cannot address the clutter problem, and the retrained YOLO model is still sensitive to clutter. Since is normally set to a value from 0 to infinity, we test 8 different values on the boat and bicycle sequences of the YouTubeBB dataset. Distant buildings may be detected as boats, and the bicycle detection performance is also easily affected by the surroundings. The results are shown in Table 3. The highest FPPIs for the boat sequence and bicycle sequence are 0.3 and 0.1, respectively, since the level of clutter varies for different categories. For “boat,” if is lower than 0.3, the FPPI would slightly decrease since clutter is added to the specialized training data and the retrained model is still sensitive to the clutter. If exceeds 0.3, the FPPI also decreases since the weight of the target samples decreases and the retraining dataset does not include sufficient training samples.

4.5. Error Analysis of the SMC-PHD YOLO Network

Since the target dataset is automatically generated by an SMC-PHD filter, it may include some error samples with uncorrected labels. To analyse whether the error samples affect the final performance, we test our SMC-PHD YOLO network with the YouTube dataset. The annotations that we employ comprise cars and zebras. The video length for each annotation, which contains 36000 frames, is 20 min. These frames are manually labelled by researchers and automatically labelled by our methods. After manually labelling these videos, 831,615 and 88,234 positive target samples were obtained for cars and zebras since multiple targets may appear in the same frame. For labels labelled by our methods, “cars” includes 797,660 true-positive samples and 212 false-positive samples, while “zebras” includes 69,821 true-positive samples and 17 false-positive samples. These results show that algorithms assign fewer labels than humans because some tiny targets and low-possibility targets are considered clutter to be disregarded. “Car” has a higher recall rate (96%) than “zebra” (79%) since cars with a regular profile are easier to detect. To further analyse these error samples, we print these data distributions. The selected features comprise the input of the last fully connected layer of YOLO. Two main dimensions are selected by t-distributed stochastic neighbour embedding. Figure 5 shows the data distribution of true positives, false positives, and false negatives. This finding proves that tiny targets are considered to be outliers and are disregarded. We also discovered that some clutter (green points) in the target dataset is considered positive samples (false positives). After the clutter is manually disregarded in the target dataset, the YOLO performance does not change. The main potential reason for this is the high threat score (99%), and the SMC-PHD filter disregards the most uncertain samples. However, this approach does not fundamentally solve the problem of clutter since some low-possibility positive samples are considered to be false negatives (red points). Some researchers suggest the use of extra information, such as audio information, to address the clutter problem [48]. Addressing the clutter problem will be one of our future research topics.

4.6. Scene-Specialized Multitarget Detector

To show the performance of the PHD method for transfer learning, we compare the baseline YOLO network, SMC YOLO network, SMC R-CNN, and our proposed SMC-PHD YOLO network and GM-PHD YOLO on the YouTubeBB dataset. Since SMC R-CNN cannot address occluded samples, we propose SMC-PHD R-CNN with SMC-PHD to improve the performance of Faster R-CNN and show the effect of the PHD method. We train the YOLO network with a general training set (COCO dataset), which contains a limited amount of target data. SMC-PHD then augments a dataset containing unseen data. The unseen data in augmented data are assigned labels that may contain errors. YOLO is fine-tuned on this target dataset, and YOLO is applied without an SMC-PHD filter. The SMC-PHD filter is only applied to augment data in this work. The parameters of the PHD filter are chosen according to the Beta-Gaussian mixture model [32]. We test these methods for the airplane, bicycle, boat, and car categories of the YouTubeBB dataset. For different categories, we train the different SMC-PHD YOLO networks where parameters are independent. The YOLO network and R-CNN fine-tuned by the SMC-PHD, GM-PHD, and SMC filters are shown in Table 4. After fine-tuning YOLO, filters are not employed for target detection. Our proposed method has the highest FPPI value of all methods for the boat and car categories, and SMC-PHD YOLO performs similarly to SMC-PHD R-CNN. According to the results, SMC improves the performance of YOLO and R-CNN by approximately 8%, and PHD further improves their performance by approximately 6%. Although GM-PHD YOLO has an 8% higher FPPI than YOLO, it is still lower than that of SMC-PHD YOLO. We speculate that the reason for this is that the number of bounding boxes identified by GM-PHD YOLO is 4% more than that identified by SMC-PHD YOLO. It is proven that SMC-PHD YOLO is more robust than GM-PHD YOLO. Therefore, in the following experiment, we mainly test SMC-PHD YOLO.

Some results of the proposed method and baseline methods are shown in Figure 6. The first line and second line of each subfigure are detected by Generic YOLO and specific YOLO, respectively. In Figure 6(a), the flapping bird is detected only by the specialized YOLO detectors. Thus, our proposed method can customize the detector for a moving target because the dataset is selected from a sequence with the likelihood function. In addition, some occluded cars are detected by our proposed method due to the detection probability. In Figure 6(b), cars and zebras are successfully detected by the specialized YOLO detector, even though only parts of the vehicles and zebras are shown in the images. For the traffic sequences shown in Figure 6(c), the number of cars detected with the specialized YOLO detector is higher than that detected with the Generic YOLO detector. With the SMC-PHD filter, our proposed method can detect occluded cars and certain small vehicles.

To further evaluate our proposed method, we further compare our methods with other baseline methods, such as that of Singh et al. [38], that of Deshmukh and Moh [39], that of Kang et al. [40], that of Maâmatou et al. [10], STSN [41], SOD [42], that of Lee et al. [43], that of Jie et al. [44], and that of Ghahremani et al. [45].

Figure 7 shows the ROC curves of the filters for the different annotations. In this experiment, we chose the bird and boat categories from the GOT-10k and YouTubeBB datasets and the car category from the MIT Traffic dataset. Due to the page limitation, Figures 7(a) and 7(b) only show a comparison between SMC-based detectors, such as SMC-PHD YOLO, and generic detectors, such as YOLO. The comparison between our proposed method and state-of-the-art methods is shown in Figures 7(c)7(e). In Figure 7(a), the method of Kang achieves a higher true-positive rate than that of Kumar and Dalal because the former is specially designed for boat detection. Compared with the Generic YOLO for boat detection, the SMC-PHD YOLO detector achieves an ROC improvement of 13%. As the boat is often occluded in the bay, the SMC-PHD YOLO detector with the detection probability performs better than the other methods. The boat detection results on the YouTubeBB dataset are similar to those on the GOT-10k dataset. Compared with generic methods, specialized methods achieve ROC improvements of approximately 10%. More baseline transform learning methods are considered in Figure 7(c), which are shown as dashed lines. The transform methods achieve better performance than the generic R-CNN or YOLO methods. SMC based on R-CNN achieves a similar ROC value as other transform detectors. Based on SMC, the SMC R-CNN detector and SMC-PHD YOLO detector achieve increases in the ROC values of 3.8% and 5.8%, respectively, compared with their baseline methods. For car detection, we test the methods only on the MIT Traffic dataset. As shown by the ROC curves in Figure 7(e), the YOLO SMC-PHD sensor outperforms all other car detection frameworks. The SMC-PHD YOLO detector also outperforms the four other specialized detectors, i.e., SMC Faster R-CNN, that of Kumar, that of Dalal, and that of Maamatou, by 5%, 6%, 9%, and 2%, respectively.

Table 5 reports the average detection rate of our proposed method and other state-of-the-art methods for the different datasets. We list the ten annotations on GOT-10k and YouTubeBB. As the Kang and Maamatou methods are designed for boat and traffic detection, they are not included in this table. Our proposed method achieves the highest detection rate, especially for the MIT Traffic dataset. SMC-PHD YOLO can detect occluded targets, such as cars. Although SMC R-CNN achieves a detection rate similar to that of the SMC-PHD YOLO detector, the number of frames per second (FPS) of the SMC-PHD YOLO network is 100 times that of SMC R-CNN. Therefore, the SMC-PHD YOLO detector considerably outperforms the generic detector with several annotations on all government datasets. Compared to the baseline YOLO detector, the SMC-PHD YOLO detector achieves a 12% higher detection rate.

Although our proposed method has the highest detection rate and large ROC values among all methods, the proposed SMC-PHD YOLO performance depends on the hyperparameters, such as the detection probability and clutter density. These parameters should be established at the beginning of training based on previous experience. Some researchers have proposed solutions for estimating the parameters of the SMC-PHD filter. For example, Lian et al. [49] used the expectation maximum to estimate the unknown clutter probability, and Li et al. [50] used the gamma Gaussian mixture model to estimate the detection probability. Applying this kind of estimation method to improve the SMC-PHD YOLO filter will be addressed in our future work.

5. Conclusion

To customize the YOLO detector for unique target identification, we suggested an effective and precise structure based on the SMC-PHD filter and GM-PHD filter. On the basis of the proposed confidence score-based likelihood and novel resampling strategy, the framework can be employed by choosing appropriate samples from target datasets to train and then detect a target. This framework automatically offers a strong specialized detector with a Generic YOLO detector and some target videos. The tests showed that the proposed framework can generate a specific YOLO detector that considerably outperforms the Generic YOLO detector on a distinct dataset for bird, boat, and vehicle detection. Correlated clutter is still challenging for SMC-PHD filters. Our future research will focus on expanding the algorithm with multimodal information to address the correlated clutter problem.

Data Availability

The data used to support this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding this work.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 51879055) and Heilongjiang Touyan Innovation Team Program.