Abstract

In order to solve the problem of low accuracy of pedestrian detection of real traffic cameras and high missed detection rate of small target pedestrians, this paper combines autoencoding neural network and AdaBoost to construct a fast pedestrian detection algorithm. Aiming at the problem that a single high-level output feature map has insufficient ability to express pedestrian features and existing methods cannot effectively select appropriate multilevel features, this paper improves the traditional AdaBoost algorithm structure, that is, the sample weight update formula and the strong classifier output formula are reset, and the two-input AdaBoost-DBN classification algorithm is proposed. Moreover, in view of the problem that the fusion video is not smoothly played, this paper considers the motion information of the video object, performs pixel interpolation by motion compensation, and restores the frame rate of the original video by reconstructing the dropped interframe image. Through experimental research, we can see that the algorithm constructed in this paper has a certain effect.

1. Introduction

Pedestrian detection is an enduring research direction in the field of computer vision, and it is an extremely practical subproblem in the big problem of target detection. Some of its applications have a great impact on our daily lives and can be directly applied to indoor and outdoor mobile robots, auto-driving cars, security monitoring, and other scenarios [1].

Pedestrian detection technology can detect pedestrians in front of the vehicle in time and take corresponding measures, which makes this technology a vital research content in vehicle-assisted driving technology. At present, through the combination with other technologies, automatic vehicle driving technology has been realized. This technology cannot only ensure the safety of traffic but also save people from the work of driving a car. The research of vehicle driving technology has been favored by more and more research institutes, universities, and companies at home and abroad and has become the focus of common concern in industry and academia [2].

Pedestrian detection technology enables intelligent robots to view people around, analyze their behaviors, and respond to human instructions so that they can better serve humans. Pedestrian detection technology is not only applied in the fields of intelligent monitoring, vehicle-assisted driving, and intelligent robots, but as humans continue to research and explore intelligence, it can be applied to any field where people appear and need to provide services. In particular, in recent years, the Internet and other media have rapidly transitioned from text images to videos, which has rapidly increased the amount of information available for video. As the most concerned part of the video sequence, pedestrian detection in the video sequence has also become an indispensable and important task in the field of machine vision research and has a pivotal position for subsequent target tracking and behavior analysis [3].

Pedestrian detection technology, as a research hotspot and focus in the field of computer vision, has attracted the attention of many research institutions at home and abroad. Through research, domestic and foreign researchers have proposed a large number of excellent pedestrian detection algorithms and obtained good research results. The literature [4] uses the template matching method to establish a layered human body template for the target to be detected, then compares and matches each area of the image to be detected, and judges whether it is the required detection target according to its final matching similarity. However, the implementation process of this algorithm is complicated in calculation and is easily affected by external factors such as environment, so it has not been greatly developed and applied. The literature [5] combined Harr features and SVM classifiers to realize pedestrian detection. The literature [6] proposed an epoch-making image histogram of oriented gradient (HOG). This method detects pedestrians through the combination of the HOG feature of the image and a single SVM classifier, which has a good detection effect. The literature [7] proposed a pedestrian detection algorithm based on HOG features and edge symmetry based on HOG features. This method first filters out the possible pedestrian areas in the image based on the symmetry characteristic of the vertical edges of pedestrians in the image and then uses the HOG + SVM method to detect pedestrians in this area, which improves the detection rate to a certain extent. The literature [8] used the image’s multilevel local binary (LBP) and multilevel gradient orientation feature (HOG) as feature sets to detect the head and shoulders of the human body in the image to reduce the dimensionality of the multilevel feature set and achieve accurate statistics on the number of people in complex scenes. With the rapid development of deep learning and other related theories, methods such as artificial god networks have gradually been applied to pedestrian detection. The literature [9] replaced the sliding window detection method in the traditional algorithm with a neural network method and accurately classified the image features to be detected, thereby determining the area to be detected. The currently widely used pedestrian detection methods can be roughly summarized into three types: motion-based analysis method, template-based matching method, and statistics-based learning method [10]. The motion-based analysis method mainly uses the difference in image information of moving objects in different frames of images and performs differential operations on corresponding pixels in adjacent frames of images in the video sequence. After that, it compares the difference result with a preset threshold, obtains the motion information in the video sequence according to the final comparison result, and then uses the characteristic information of the motion area to further detect whether the object is a pedestrian [11]. The template-based matching method is to define a hierarchical human body template for the pedestrian target to be detected, then compare and match the image to be detected with each area of the template, and finally determine whether the target to be detected is a pedestrian based on its matching similarity [12].

The method based on statistical learning refers to obtaining a classifier representing pedestrians by training the image sample data and then using the classifier to judge and classify the image to be detected [13]. Statistics-based methods can be divided into support vector machine (SVM), AdaBoost, and neural network-based methods according to their classifiers. The classifier used in the support vector machine-based method is SVM. SVM is based on the principle of risk minimization. The main idea is to determine a linear function through which the new sample data can be correctly separated in the maximum interval hyperplane [14]. AdaBoost is an iterative algorithm. The main idea of this algorithm when implementing pedestrian detection is to combine different weak classifiers trained from the same set of sample data to form an ultimate strong classifier and then use the strong classifier to detect and search the input image to be detected to determine whether there are pedestrians in the image [15]. The neural network is a research hotspot in the field of machine vision and has been widely used. Its application in the field of pedestrian detection is mainly by extracting high-latitude image features in the image and classifying and recognizing pedestrians in the image according to the features. With the continuous development of computer hardware, combined with the relevant research results of pedestrian detection theory, in the actual development and application process, domestic and foreign researchers and institutions have proposed a series of design schemes [16]. The literature [17] used algorithms such as mixture Gaussian background modeling and moving target detection and tracking. Moreover, it used CCS compiler optimization, software pipeline optimization, algorithm code optimization, TI-related function library, and other optimization methods to implement a complete video target tracking system on the DM6437 hardware platform. The literature [18] designed an effective pedestrian detection system on the DM6437-embedded DSP platform by analyzing the operating cycle of related modules and combining the characteristics of the hardware platform. The literature [19] used the DM8168 hardware platform to propose a pedestrian classification and detection algorithm combining the foreground enhancement detection algorithm with the CENTRIST operator, which realizes the effective detection of the pedestrian detection algorithm on the DSP platform. The literature [20] designed a detection system for pedestrians waiting to cross the road and developed and implemented it on the TMS320C40 platform. The experimental results proved that it has a good detection rate. Based on the TMS3206455 platform, the literature [21] proposed an algorithm based on wavelet pyramid decomposition, combined with the tunnel filtering algorithm to achieve multitarget detection of cars, pedestrians, and bicycles in complex scenes. The literature [22] designed a pedestrian detection hardware platform based on DM6437 and studied a DSP car-assisted driving pedestrian automatic detection system that can detect pedestrians and issue warnings, which has a high detection efficiency.

3. Basic Theory of Texture Feature Extraction

The texture feature describes the phenomenon that the image is repeated or changed in space and reflects the uniformity of the image. It is a feature description consistent with the human visual perception system. The texture feature can describe the detailed information and change the trend of the local area of the image, so it is effectively used in the field of image classification.

Local Binary Pattern (LBP) is mainly used for texture classification. Because of its simple calculation, it is further used for target tracking, facial expression recognition, medical imaging, and image classification. This method considers the rectangular square in the image and encodes the intensity difference between the center pixel and the N neighborhood pixels on the circle of radius R in the binary format. Then, it assigns a binary value (0 or 1) to each neighboring pixel based on the intensity difference and finally performs a weighted summation to obtain the LBP value of the center pixel. In a similar way, each pixel value in the image is replaced with an LBP value. The calculation formula of LBP is as follows:

In the above formula, R and N represent the radius and the number of neighboring pixels, respectively. represents the center pixel, and represents the neighborhood pixel. An example of LBP calculation is shown in Figure 1, where .

For the adjacent pixel of the center pixel , the LNDP mode value is calculated according to the following process:

The difference between each neighboring pixel and the neighboring pixel is and , and each neighboring pixel is assigned a binary value F based on these two difference values:

For the center pixel , the following binary values can be used to calculate the LNDP mode value:

Figure 2 shows the calculation process of the LNDP mode and windows. Figures 2(a) and 2(b) show the number and pixel intensity of the neighboring pixels, and window represents the calculation of the LNDP mode for each neighboring pixel. Among them, window (f) means that pixel calculates the difference between pixel and pixel and pixel through formula (2), and the difference is “1” and “5,” respectively. Since the two difference values are both positive numbers, “1” is assigned to the pixel by formula (3). Similarly, for other pixels, the mode value has been obtained and shown in window Figure 2(c). As shown in window Figure 2(d), the mode value is multiplied by the weight, and the LNDP value is obtained by summing the mode values in window Figure 2(e).

The traditional LBP mode compares the center pixel in the image with the neighboring pixels to generate a binary mode value. This mode only considers the size relationship between pixels, not the difference between pixels. Figure 3 shows two completely different partial structure modes, and the same binary code is obtained after LBP encoding. It shows that the local structure mode encoded by LBP loses a lot of local information, and LBP also ignores the influence of neighboring pixels on its binary encoding.

The local neighborhood enhancement mode calculates the LNIP mode value, as shown in Figure 4. The figure shows the neighborhood (pixel)-adjacent (pixel) relationship of each of the 8 neighborhood pixels of the center pixel . When , has 4 adjacent pixels. When , has 2 adjacent pixels. Its mathematical definition is as follows:

For the symbol mode of LNIP, we first calculate the relative difference symbol between the neighboring pixel of the center pixel and its corresponding neighboring pixel . The M bit pattern can be obtained, and M is the number of elements, as shown in formula (6). Similarly, the relative difference symbol between the central pixel and its corresponding neighboring pixel is calculated, as shown in the following formula:

By performing bitwise XOR operation on and , the structure change of the bit pattern is calculated. When calculating the bitwise XOR, the M bit mode value can be obtained by formula (7). The symbol represents the number of 1 in the binary number , and formula (7) finally determines the final binary value of the window. Among them, when i is an odd number, , and when i is an even number, . The symbol mode value of LNIP can be obtained by the following formula:

When calculating the amplitude mode of the window, the concept of statistical dispersion is taken into consideration. To calculate this dispersion, the absolute average deviation of a particular pixel is considered and compared with the average deviation of the center pixel. Specifically, the average deviation of the neighboring pixel and the corresponding neighboring pixel and the average deviation of the neighboring pixel relative to the center pixel are calculated. Then, the average deviation is compared with the threshold to determine the binary bit value of the adjacent pixel in the window. The calculation formula is as follows:

Figure 3 shows two very different pixel blocks through an example, and the same LBP code is obtained through the LBP mode.

In Figure 4, Figures 4(a) and 4(b) represent the distribution of the center pixel and the neighboring pixels. Figures 4(c)–4(j) represent the calculation process of the symbol mode and amplitude mode of the local neighborhood enhancement mode.

4. AdaBoost Algorithm

In the AdaBoost algorithm process, the original dataset is first initialized, that is, each sample is weighted and averaged and assigned a corresponding weight, and the weight of each sample is used to calculate the next algorithm iteration. If a sample is misclassified in the classification, the weight of the sample will increase; otherwise, the weight will decrease. Through this mechanism, the AdaBoost algorithm will focus on solving difficult-to-classify samples. Figure 5 is the schematic diagram of the AdaBoost algorithm. First, the training set is initialized with weights, and weak learner 1 is trained, and the weight of the sample is updated according to the classification error of weak learner 1. That is, the weight of correctly classified samples decreases, and the weight of incorrectly classified samples increases. In this way, misclassified samples get more attention in subsequent iterations. After that, it is carried out in sequence according to the same method until the end of the iteration, and the weak learners are integrated to form a strong learner.

The core idea of the algorithm is to linearly weight several homogeneous classifiers to form a strong classifier, so the algorithm is mainly used to calculate the weight of each classifier. That is, after a weak classifier is given, the initial dataset is first fitted to the weak classifier, the weight of the classifier is obtained according to the fitting result, and the sample weight of the dataset is adjusted.

The AdaBoost algorithm process is as follows:(1)Train sample weight: in the case of the initial training sample without any prior knowledge, the weight of each sample in N training samples is , and represents the weight set:(2)Calculate the classifier error: the weight of the classifier is calculated by the accuracy of each classification. The higher the classification accuracy, the greater the weight of the classifier. In the following formula, represents the classification category of sample in the tth iteration, and represents the true label of sample :(3)Calculate the weight of the classifier: according to the classification error obtained in step (20), the proportional coefficient of the weak classifier in the final strong classifier is determined as follows:(4)Update sample weight: adjust the weight of the sample according to the classification result in step (2). In the following formula, represents the weight of the ith sample in the iteration:(5)Output strong classifier: the weight of the classifier and the corresponding classification accuracy are linearly superimposed to form a strong classifier, as shown in the following formula:

As an improved boosting algorithm, the AdaBoost algorithm has a high integration capability. For a variety of homogeneous classifiers, the algorithm can build a good and strong classification model, and the algorithm will not be overfitting after repeated training.

As a representative of boosting algorithms, AdaBoost has been widely used. Because the AdaBoost algorithm has too strict requirements on the classification accuracy of the base classifier, the algorithm does not have universal applicability. This section improves the algorithm based on this shortcoming. In addition, combined with the base classifier DBN algorithm, the strong classifier formula output in the AdaBoost algorithm is reset.

Aiming at the problem that the AdaBoost algorithm is too strict for the classification accuracy of weak classifiers, Zhu et al. proposed the SAMME algorithm. The calculation method of the classifier weight in this algorithm is different, as shown in the following formula:

The weight of the classifier adds a positive term on the original basis, where M represents the total number of categories. The increase of this item has brought a great improvement to the performance of the algorithm. The original AdaBoost algorithm requires the correct rate of weak classifiers to be greater than , but it is difficult for general classifiers to meet this requirement. The improved classifier weights show that only the classification error of the classifier is required, that is, as the number of iterations increases, the accuracy requirement of the classifier becomes lower and lower. At the same time, according to the update rules of sample weights, the SAMME algorithm will obtain larger weights for misclassified samples when the sample weights are updated. When , the SAMME algorithm is the same as the AdaBoost algorithm.

It can be seen from the above analysis that the regular term added to the calculation formula of the classifier weight greatly increases the universality of the AdaBoost algorithm. However, this regular term is not arbitrarily given, but is obtained from the forward superposition model of multiple types of exponential loss functions. The expression of the two-class index loss function is as follows:

In the above formula, , and h is the classification label of the classifier.

The above formula is extended to the expression of the multiclass index loss function, as follows:

In the above formula, the value of is

In the formula, is the output value of the mth category determined by the classifier, and M is the total number of categories in the dataset. Here, the condition of k needs to be restricted, and the constraint condition is added according to the above formula. In this case, when , the m-class index loss function becomes a two-class index loss function.

The design of the ensemble model uses the DBN classifier as the base classifier. According to the meaning of the DBN structure to output the classification results, the formula for the output strong classifier is improved, as shown in the following formula:

The original formula is

The original formula is decomposed and analyzed: the algorithm symbol in can be interpreted as that when the logical expression a is true, ; otherwise, . Therefore, the output form of the formula is a matrix composed of 0 and 1. 1 means the classification is correct, and 0 means the classification is wrong. Finally, using the argmax function, the maximum value of each row is obtained for each sample category. The improvements made in this article arewhere represents the proportion of each sample in each category before DBN output, and it is usually a decimal. At this time, the weight of the classifier weights the original data, which can better reflect the results of each classification.

The DBN network structure consists of four parts: the number of input layer nodes, the number of output layer nodes, the number of hidden layers, and the number of hidden layer nodes. This article uses the following empirical formula to determine network parameters:

Among them, S is the number of hidden layer nodes, m is the number of input layer nodes, n is the number of output layer nodes, and k is a constant between 1 and 10. From this empirical formula, the value of the number of hidden layer nodes can be obtained, and the final number of hidden layer nodes can be determined through experiments. Since m and n are known, the number of hidden layer nodes S can be determined by the value of k. According to the above formula, there are p values for the number of first layer nodes in the hidden layer:

When the number of hidden nodes in the first layer is determined, it is used as the input layer to determine the number of nodes in the second hidden layer. According to the above formula, has q kinds of values:

Similarly, the number of hidden layer nodes in the next layer can be determined according to the number of hidden layer nodes in the second layer. Through the training results, the number of hidden layer nodes is adjusted, and the optimal value is finally determined.

Since the DBN network first trains the RBM layer and then performs error feedback training on the BP layer, the network parameters of the RBM layer and BP layer need to be set separately. Since the initial network parameter values need to be adjusted during the training process, the DBN training process is also a batch training process:

The maximum number of cycles is also a factor that affects the DBN classification effect. This parameter also needs to be adjusted in the cycle. At the same time, the learning rate and momentum factor are also important parameters that affect the network effect.

5. Fast Pedestrian Detection Algorithm Based on Autoencoding Neural Network and AdaBoost

The framework of the autoencoding and decoding deep network is shown in Figure 6. It can be seen from the figure that the network is similar to the autoencoding network and consists of two processes: encoding and decoding.

The codec deep network structure is shown in Figure 7. It can be seen from the figure that its network structure is different from the autoencoding network. The coding process of the network is composed of multiple nonlinear mapping layers [h1, h2, h3]. The three nonlinear mapping layers compress feature dimensions layer by layer to make the network’s coding process remove redundant features as much as possible, so as to retain the features after the face pose changes are removed.

The convolutional autoencoder is an unsupervised algorithm model. However, convolutional autoencoders no longer use full connections in encoding and decoding, but instead use convolution operations. The convolutional autoencoder also protects the input layer, hidden layer, and output layer, and its training steps also include encoding, decoding, and parameter learning. The encoder uses a convolution operation to obtain a hidden layer from the input layer. This process is called convolutional encoding. The decoder uses a deconvolution operation to reconstruct the hidden layer to obtain an output layer with the same dimensions as the input layer, which is called convolutional decoding. Finally, the error between the output layer and the input layer is calculated, and several iterations are performed to minimize the error and obtain the convolutional autoencoder parameters. Figure 8 is a schematic diagram of the operation of the convolutional autoencoder.

The motion estimation algorithm based on block matching is simple and easy to implement. The basic principle is to find the block with the smallest matching error with the current block in the search range of the reference frame through a certain matching criterion. The relative displacement between each current block and the searched matching block is the motion vector MV, which is the motion track of the block, as shown in Figure 9.

Among the commonly used classification algorithms, the realization process of the Naive Bayes classification algorithm is relatively simple, but it cannot meet the condition of independent distribution of data in practical applications. The K-nearest neighbor classification algorithm is simple and effective, but the data processing is more complicated. Multilayer neural networks achieve high classification accuracy at the cost of setting a large number of initial parameters. The support vector machine algorithm (SVM) focuses on the classification of small samples and is the simplest algorithm in linear classification problems. Therefore, the SVM classifier is the most suitable for the two types of problems in this paper that only need to detect pedestrians and nonpedestrians. The flow chart of pedestrian classification is shown in Figure 10.

The target detection network is used to detect pedestrians in a certain frame of the night vision video sequence, and then, the KCF tracking model is used to track the target for several frames, forming a detection-tracking alternate mode. At the same time, by means of intermittent cyclic triggering, after the previous detection is completed for 2 seconds, the system is used to detect and track subsequent videos again, and the cycle continues until the end or termination of the video. Among them, the reason for the detection at an interval of 2 seconds is that the entire process of the target pedestrian from appearing to being caught by human eyes to leaving the field of view takes at least 2 seconds. Therefore, it is most appropriate to set it to detect once every two seconds to achieve the effect of reducing the number of detection frames, while avoiding missed detection of new targets in the field of view. The multitarget rapid detection-tracking detection cycle system is shown in Figure 11.

The idea of multitarget tracking is to execute multiple tracker objects at the same time. Since this article is applied to roads where cars are driving at night and there are not many pedestrians, the tracking algorithm is still very efficient. The multitarget tracking process is shown in Figure 12.

6. Algorithm Performance Verification

Next, the performance verification of the fast pedestrian detection algorithm based on autoencoding neural network and AdaBoost constructed in this paper is carried out. In order to verify the efficiency of frame selection fusion and adaptive partition fusion proposed in this paper, this paper compares it with direct fusion and evaluates it from the perspective of video fusion speed. The speed of video fusion is measured by the number of frames per second (FPS). Under the same test video, this article uses two fusion algorithms to average the time required to fuse one frame to indirectly reflect the fusion speed. In order to avoid the interference of three factors such as scene, halo area, and vehicle speed on the fusion speed, the experiment is divided into the following four groups of situations, and the influence of the third variable on the fusion speed is studied by unifying two of them. Four sets of experiments: (1) pedestrians drive at normal speed on main roads in the city, but the halo area is small; (2) pedestrians drive at normal speed on main roads in the city, but the halo area is large; (3) the halo area in the suburbs is small, but the driving speed is normal; (4) the halo area in the suburbs is small, but the driving speed is faster. The average fusion time of one frame in each group of videos recorded and the direct fusion processing speed are shown in Table 1 and Figure 13, and the selected frame-partition fusion processing speed is shown in Table 2 and Figure 14.

In summary, it can be seen from the subjective and objective evaluation results of the fusion video that the algorithm proposed in this paper has the best video fusion effect, and its video is smooth, content is synchronized, and the speed of video fusion is also significantly improved.

Next, this paper analyzes the pedestrian detection accuracy rate of the algorithm constructed in this paper. A total of 90 sets of pedestrian videos are set up. The algorithm is used to detect and count the detection results. The results are shown in Table 3 and Figure 15.

From the above results, it can be seen that the algorithm constructed in this paper has a high accuracy rate in the pedestrian detection process, so the algorithm in this paper can be applied to practice.

7. Conclusion

In order to solve the problem of low pedestrian detection accuracy of real traffic cameras and high missed detection rate of small target pedestrians and improve the speed and accuracy of pedestrian detection, this paper studies pedestrian detection technology. By analyzing the classification performance of the classifier, this paper combines the AdaBoost ensemble idea with the DBN classifier and proposes a classification algorithm of autoencoding neural network and AdaBoost-DBN. Aiming at the problem that the AdaBoost algorithm is too strict for the classification accuracy of the weak classifier, this paper improves the calculation formula of the classifier weight, which effectively reduces the accuracy requirements for the weak classifier. Moreover, this paper improves the traditional AdaBoost algorithm structure, that is, resets the sample weight update formula and the strong classifier output formula and proposes a two-input AdaBoost-DBN classification algorithm. In addition, to solve the problem of unsmooth playback of the fusion video, this paper considers the motion information of the video object, performs pixel interpolation with motion compensation, and restores the frame rate of the original video by reconstructing the dropped interframe image. Finally, this paper designs experiments to verify the performance of the algorithm proposed in this paper. The results show that the results of the research meet the expected goals of algorithm construction.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

Science Research and Innovation Team of Fuyang Normal University (kytd202004), Outstanding Talent Development Programme of College of Information Engineering, Fuyang Normal University (2018FXJT02), Anhui Provincial Education Department’s Excellent Youth Talent Support Program (gxyqZD2020054), Project of Industry-University-Research Innovation Fund (2018A01010), and Anhui Provincial Education Department’s Excellent Youth Talent Support Program (no. gxyq2017159).