The target detection algorithms have the problems of low detection accuracy and susceptibility to occlusion in existing smart cities. In response to this phenomenon, this paper presents an algorithm for target detection in a smart city combined with depth learning and feature extraction. It proposes an adaptive strategy is introduced to optimize the algorithm search windows based on the traditional SSD algorithm, which according to the target operating conditions change, strengthening the algorithm to enhance the accuracy of the objective function which is combined with the weighted correlation feature fusion method, and this method is a combination of appearance depth features and depth features. Experimental results show that this algorithm has a better antiblocking ability and detection accuracy compared with the conventional SSD algorithms. In addition, it has better stability in a changing environment.

1. Introduction

The concept of a smart city originated from the idea of smart earth proposed by IBM in 2008. Among them, the target detection algorithm is one of the key technologies of smart cities. However, existing computer vision algorithms are difficult to deal with target detection problems in complex backgrounds, such as the effects of light, target size changes, and target occlusion. The introduction of deep learning has opened up a new path for target detection. In recent years, more and more researchers have begun to conduct in-depth research on deep learning algorithms in target detection [1, 2]. Deep learning avoids the drawbacks of the traditional method of manually extracting features, because of the characteristics that its deep structure can effectively learn from large amounts of data [3, 4]. Currently, based on the target detection algorithm, the depth study of literature is not much. However, from the perspective of the depth model, it can be broadly classified into a target detection algorithm based on CNN and a target detection algorithm based on SAE [57]. The target detection algorithm of the SAE depth model is usually combined with the traditional classical algorithms. It uses hidden layers to learn a representation of data and to preserve and better obtain more efficient information by using a nonlinear feature extraction method that does not use classification tags. This method is not conducive to information classification, but visual tracking itself needs to distinguish the target from the background. Therefore, target tracking is not the strength of the SAE algorithm. The target detection algorithm based on CNN combines artificial neural networks and convolution operations. It can recognize a wide variety of target modes, and a certain degree of distortion and deformation has good robustness. Therefore, we use the target detection algorithm based on CNN for this article. Among them, the SSD algorithm recognition structure is superior to a similar algorithm in the mAP and training speed [8, 9]. By using different sizes and different proportions of anchors at different levels, the algorithm can find the best matching anchor with ground truth for training. However, the recognition effect of the target on the small size is relatively poor and is easily affected by the occluder, which undoubtedly affects the application of the algorithm in practical applications.

Regarding the issue above, this article proposes a target detection method based on the SSD algorithm and feature extraction fusion. It is based on the traditional SSD algorithm [10, 11]. Its search window is dynamically adjusted by using an adaptive strategy changes according to its operating conditions, which can reduce unnecessary calculation accuracy problems during the entire detection target fixed occurring. At the same time, in order to improve the classification ability of the features of the algorithm, we did the following optimization; like in the feature fusion method of weighted correlation, we combined with the appearance depth feature and the motion depth feature to improve the accuracy of the objective function and perform experiments in different complexity image environments. This algorithm is more time-consuming than the traditional SSD algorithm. The accuracy of the target gradually increases with the complexity of the image. The gap between the detection accuracy and success rate and the traditional SSD algorithm gradually widens and remains at about 86%. It is said that the algorithm can be applied to a variety of environments and maintain good stability. It has good practical value in the development of a smart city.

2. Principle of Algorithm

Traditional SSD is based on a forward-propagating CNN network. It produces a series of fixed-size bounding boxes and has the possibility of containing object instances in each box then performs a nonmaximal suppression to get the final predictions. The model network structure is as follows.

From the structure diagram of Figure 1, we can get the SSD network can be divided into two parts: the basic network and additional functional layers; the former is used for the standard network for image classification, but all the layers involved in the classification are eliminated; the latter mainly achieves the following goals [12]:

Multiscale feature maps for detection: the convolutional feature layer is added to obtain feature layers of different scales so as to achieve multiscale target detection.

Convolutional predictors for detection: for each added feature layer, a set of convolution filters is used to obtain a fixed set of target detection predictions.

Each of these convolutions results in a set of scores or coordinate offsets from the default candidate regions. Finally, combining the obtained detection and classification results, the position of each object in the image and the object category in the image can be obtained.

3. The Algorithm of This Paper

3.1. Select the Aspect Ratio of the Default Box

The feature map will be smaller and smaller at deeper layers. This is not only to reduce the computational and memory requirements but also has the advantage that the last extracted feature map will have some degree of translation and scale invariance [13].

In the SSD structure, the default boxes do not necessarily correspond to the receptive fields of each layer. Predictions are made by introducing feature maps; the size calculation formula of the default box in each feature map satisfies:

where is , value 0.2, and value 0.95.

But usually, the size of each default box will not be adjusted after this calculation. This is undoubtedly unfavorable for detecting the object whose size will change and then affecting the detection effect. Figure 2 shows the structure of the target of the traditional SSD algorithm.

Among them, the objective loss function is a weighted sum of the localization lossloc and confidence loss in Figure 2:

In the formula:

among them, , is the number of matching default boxes, x indicates whether the matched box belongs to category , value {0,1}, is a predictive box, and is the true value of that ground truth box. is the confidence that the selected target belongs to category . Weight item =1.

We need to filter out the boxes; we finally give from these boxes.

The pseudocode is as follows.

for every conv box:

for every class:

if class_prob < theshold:


predictive box = decode(convbox)

nms(predictive box) # Remove very close boxes

In this way, the target coordinates can be found effectively, thereby improving the detection effect of the algorithm on target detection.

3.2. Adaptive Strategy

Usually, the width and height of each default boxes are fixed during the entire target detection process. However, when the behavior of the controlled object changes due to changes in the characteristics of the object, this type of parameter fixation tends to produce undesirable results. Control the effect, so you need to make use of adaptive strategies for dynamic adjustments.

In this paper, the width and height are adaptively adjusted by combining the size of the default box obtained above. The calculation formula is as follows: where is equal to the number of blocks in the image; , is the component of the horizontal direction and the vertical directionof the objective function, respectively; , is newly born into the width and height of the default boxes; and is the whole frame of motion intensity.

By determining the motion complexity of the block based on the calculated motion vector and the motion vector of the current block, the degree of motion of the current block can be effectively judged by the degree of difference between the horizontal component and the vertical component. The formula is as follows:

Among them, is the median of the three macroblocks in the left, top, and upper right directions. is the objective function of the -th block obtained above. separately expressed the horizontal and vertical movement complexity. represents the complexity of the motion of the current block.

The search window size is as follows:

3.3. Feature Extraction of Weighted Correlation

In order to improve the classification ability of features, a feature fusion method based on weighted correlation is used to combine the appearance depth feature and the movement depth feature to form a multidimensional feature vector. For the convenience of presentation, the appearance depth feature and the movement depth feature are denoted by and , respectively. The merged feature is:

Here, is the weighting factor, and . The weighting coefficients are determined according to intraclass consistency and class separability. Intraclass consistency: It is generally expected that the samples in the same class are as close as possible in the feature space. However, there is usually a large variance in the sample characteristics in the same class. Therefore, it is not necessary to require all samples in the same class to be close to each other. A trade-off is to ensure that the samples in the same neighborhood within the same class are as close as possible.

Assume and .

Let us denote the -th and -th samples; then, the intraclass consistency is defined as:

In the formula, means the sample and with which belongs to the index set of the nearest neighbor samples of the same class.

According to the target characteristics with good intraclass consistency, this paper determines the weighting coefficients by solving the following optimization problems:

among them, , , and is a control parameter.

Combining the above equations, the gradient descent method is used to solve the equation, which can be solved:

where is the number of iterations, is the iteration step, and is the objective function.

Among them

Thus, you can get the final objective function:

After many experiments, the summary data is obtained in Figure 3; we can see the experiments have found that blindly increasing the control parameters does not have any new improvement in the detection accuracy. It is appropriate when the number of key region control parameters is . This ensures that a unique solution that can converge to a global optimum through gradient descent can be guaranteed.

4. Simulation Experiment

4.1. Data Sets and Test Standards

In order to test the speed and accuracy of the algorithm, the training data from this paper comes from the ImageNet dataset, which contains more than 14 million pictures, covering more than 20,000 categories [15], like Figure 4. The ImageNet dataset is a field that is currently applied in the field of deep learning images. Most research work on image classification, positioning, and detection is based on this dataset. It is a huge image library for image/visual training. It has been widely used in research papers in the field of computer vision and has almost become the “standard” data set for the performance testing of algorithms in the field of deep learning images.

4.2. Experiments and Results

The experiment was simulated using a laptop computer, tested using Python 3.6, TensorFlow v0.12.0, Pickle, OpenCV-Python, and Matplotlib (optional), and the data was analyzed using MATLAB 2014a. In order to detect the target detection effect of the algorithm, this paper compares the traditional SSD algorithm with the literature algorithm [14].

In order to test the retrieval efficiency of the algorithm, the data analysis is performed on the time window of the evaluation of the image and the time required for the detection of the target. Observing Table 1, we can see that in the testing phase, the evaluation window of this algorithm is more time-consuming than the traditional SSD. In algorithm 0.04s, the target detection time is better than the traditional SSD algorithm 0.1s, and in the training phase, the evaluation time of the algorithm in this paper is better than the traditional SSD algorithm 0.05s, and the target detection time is better than the traditional SSD algorithm 0.65s.

From Table 1, it can be seen that the algorithm is time-consuming in the test phase and the training phase is better than the traditional SSD algorithm, but the text is slightly weaker than the literature algorithm [14]. In addition to focusing on the time-consuming, the target detection algorithm needs to analyze and evaluate the accuracy and success rate of the detection target. Therefore, this paper detects the precision and success rate of the target and in different complex scenarios. The test was conducted, in which the selected image material was reconstructed from low to high (food, vegetable, bird, person). The data is shown in Table 2.

As can be seen from Table 2, it is clear that the target detection accuracy of this algorithm in different environments is better than the traditional SSD algorithm, and as the complexity of the image is gradually increased, the accuracy of detection under the image and the success rate gap gradually increase; the detection accuracy rate is maintained at about 86%, while in the literature algorithm [14] gradually decreases as the complexity increases, and the target detection effect in a variety of complex scenarios embodies the algorithm and has a certain degree of universality.

In order to test whether the algorithm has accuracy in the presence of a shelter, this paper needs to use two sets of experiments to verify. One group adopts a food image with smaller image complexity, and one group employs a person image with the highest complexity; data acquisition is performed using the above algorithms, respectively, and the data image is shown.

By observing Figures 5 and 6, we can see that with the increase of target coverage rate, the setting rate of each algorithm gradually decreases, but it can be seen through observation that the data collected by using this algorithm are kept above the other two algorithms. When the target coverage rate reaches 60% and higher, the setting rates of both the algorithm [14] and the traditional SSD algorithm begin to decline drastically. However, although there is a problem of drop in the article algorithm, the decrease rate of the algorithm is slower than that of the other two methods, which effectively proves the feasibility of the feature extraction.

In order to test the detection effect of the algorithm in the actual environment, we have selected the daily traffic scene for the target detection, real-time extraction of the people, and vehicles appearing in the traffic; the following effect map is obtained.

By observing Figures 7 and 8, we can find that it can be seen from the observation that the algorithm can accurately retrieve the target in the complicated traffic area and track and identify it. Although the process will be blocked by vehicles or pedestrians, there is still no problem that is currently lost, which effectively validates the feasibility of the algorithm.

5. Conclusion

We propose a target detection method based on the SSD algorithm and feature extraction fusion. The algorithm is based on the traditional SSD algorithm. The algorithm adopts an adaptive strategy to dynamically adjust the search window according to the change of the running status of the image and combines the appearance depth feature and the movement depth feature in combination with the feature fusion method of weighted correlation, and finally improves the precision extraction of the objective function. Through experiments, the algorithm can maintain a high and stable target detection effect under different complexity of the image environment and is more suitable for the environment changeable target detection environment. However, it still cannot be effectively reduced in the time-consuming aspect of the algorithm. This will serve as a research focus in the future and will be further studied.

Data Availability

The data used to support the results of this study needs to be obtained with the consent of the corresponding author.

Conflicts of Interest

The authors declare that they have no competing interests.

Authors’ Contributions

The authors have equally contributed to the manuscript. All authors read and approved the final manuscript.


This article is supported by Sun Yat-Sen University Xinhua College 2017 School-Level Scientific Research Startup Fund General Project: Research and Design of Target Trajectory Tracking System Based on Depth and Visual Information Fusion (Project Code: 2017YB001).