Abstract

It has become a challenging research topic to accurately identify the vehicles in the past from the mass monitoring data. The challenge is that the vehicle in the image has a large attitude, angle of view, light, and other changes, and these complex changes will seriously affect the vehicle recognition performance. In recent years, the convolutional neural network (CNN) has achieved great success in the field of vehicle reidentification. However, due to the small amount of vehicle annotation in the dataset of vehicle reidentification, the existing CNN model is not fully utilized in the training process, which affects the ability to identify the deep learning model. In order to solve the above problems, a double-channel symmetric CNN vehicle recognition algorithm is proposed by improving the network structure. In this method, two samples are taken as input at the same time, in which each sample has complementary characteristics. In this case, with limited training samples, the combination of inputs will be more diversified, and the training process of the CNN model will be more abundant. Experiments show that the recognition accuracy of the proposed algorithm is better than other existing methods, which further verifies the effectiveness of the proposed algorithm in this study.

1. Introduction

In recent years, the society pays more and more attention to the public security problem, and the monitoring equipment is more and more popular. A large number of surveillance cameras are used in crowded places prone to public security incidents, such as traffic intersections, parks, large shopping malls, stations, and airports. The emergence of surveillance cameras has brought great convenience to the case detection of public security organs, such as suspected vehicle chase, cross-scene vehicle search, abnormal event detection, and so on [1, 2]. A large number of surveillance cameras form a huge network of surveillance. Although the monitoring system has developed rapidly, it has brought great challenges to the management and analysis of monitoring data [3, 4]. At present, the monitoring system mostly adopts the method of real-time camera and human participation to monitor. The massive monitoring data is a big problem for the personnel who are in charge of monitoring the video. There are two reasons: (1) the monitoring system generates data in real time, resulting in a huge amount of data; (2) the real-time monitoring data records a scene with random changes, and it is difficult for the monitor staff to pay attention for a long time during the long-time observation. It can be seen that this kind of monitoring mechanism with human participation is no longer applicable to the management and analysis of monitoring data. However, the emergence of vehicle reidentification technology overcomes the deficiency in the supervision mechanism of human participation.

In recent years, deep learning models represented by the convolutional neural network (CNN) have achieved great success in the field of computer vision. At the same time, CNN also led the research in the field of vehicle reidentification. Compared with the traditional vehicle reidentification method designed by hand, the vehicle reidentification method based on CNN can overcome the complex changes of vehicles more effectively and achieve higher performance. However, vehicle reidentification is different from other computer vision tasks because it is very difficult to annotate the vehicles, resulting in a small amount of vehicle annotation in the existing dataset. On the limited training set of the current picture training set, the training of the existing single-channel CNN model will make the training process of the CNN model insufficient. In order to make the combination of input images more diversified, multiple combinations of images can be used as input to fully train the CNN network. At the same time, the recognition rate will be improved because the double-channel CNN network can input more features.

This study attempts to design a double-channel symmetrical CNN structure for vehicle reidentification by improving the network structure. In this double-channel structure, two samples are input at a time. At this time, compared with the previous single-channel CNN model, the input combination form of this double-way CNN model is more diversified, which is suitable for the deep learning model with stronger ability of learning and obtaining discrimination.

The task of vehicle reidentification [5, 6] is to study how to accurately identify the vehicle that has appeared in a particular occasion in the mass monitoring data, in which the monitoring data are mainly image data. The challenge of the task is that the vehicle in the image has a large attitude, angle of view, and other complex changes. In addition, during the shooting process, different lighting will also make the appearance of the vehicle change greatly. The above changes will seriously affect the performance of vehicle recognition. At present, the research on target reidentification mainly focuses on the field of pedestrian reidentification [710] and is rarely applied to other targets. Since 2015, a small number of scholars have tried to enter the field of vehicle reidentification, but they can only be applied to images of the same scale and angle, with weak robustness to environmental changes or based on small datasets.

In order to improve re-ID capability, some methods utilize additional attribute information such as the model/type and color to guide vision-based representation learning [11]. For example, [12] introduced a two-branch retrieval pipeline to extract differences between models and instances. Yan et al. [13] studied the multiparticle relationship of vehicles with multilevel attributes. Other works study temporal and spatial associations, which derive additional benefits from the topological information of the camera [14]. In addition, some methods use GAN [15] to generate images from the required viewpoint, so as to achieve viewpoint alignment. It can be said that these works solve the problem of viewpoint change through viewpoint alignment.

In addition, [16] claimed that in addition to the dataset used for training, the features made by traditional handwork were easy to generate deep features, so the two features were combined to achieve an improved representation. Liu et al. [17] used a multimodal finite element analysis including visual features, license plate, camera position, and other contextual information in a coarse to fine vehicle retrieval framework. In order to enhance training data and achieve robust training, [18] used the generated countermeasure network to synthesize vehicle images with different directions and appearance changes. Zhou and Shao [15] through the attention model of advertising learning and visual perception, the visual perception representation of vehicle re-ID is learned. Zhang et al. [19] proposed an improved joint optimization of three-one-loss execution and an auxiliary classification loss as a regularization to represent the in-sample variance.

3. Single-Channel CNN Structure

The single-channel CNN structure is introduced in this section first; then, double-channel CNN structure is detailed in the next section.

In the training set of vehicle reidentification, the single-channel CNN model based on identification is used for learning, so that the deep learning model obtained after training can distinguish different vehicles. Based on the existing classic CNN model, all the convolutional layers and full connection layers in the AlexNet [20] and ResNet-50 [21] models are used. The default parameters provided in literature [20, 21] are adopted, and the output of the last full connection layer is modified to be the total number of different vehicles in the vehicle reidentification training set. The CNN model of the single-channel method is the fine-tuning of the pretraining model obtained on the ImageNet dataset [22], at which time the convergence rate of the CNN model is faster. Especially, in the case that the scale of the vehicle reidentification training set is not very large, this training strategy is more effective and achieves the purpose of distinguishing different vehicles.

The network training process is described as follows. The vehicle reidentification training set is recorded as , the vehicle image is , and the identity (ID) is . The vehicle image is first processed to a size of pixels and then randomly cropped to a fixed size (AlexNet is pixels and ResNet-50 is pixels). The processed vehicle image is sent to the data layer of the CNN model as an input to the network. The goal of network training is to get a deep learning model M through deep learning. It is equivalent to mapping: , where represents the parameters of each layer in the CNN model. In the process of each minibatch iteration, the parameter is updated using the random gradient descent (SGD) algorithm. In the iteration, the current parameter is updated as followswhere is the learning rate, is a set of minibatch samples taken randomly from , is the gradient operation, and is the loss function, which is the softmax loss function. The softmax loss function acts as a supervisory signal to guide the network training process. As the training process progresses, the value of the loss function gradually decreases. At this point, the trained network is convergent.

In the process of vehicle recognition, the deep learning model M obtained from network training is used as the feature extractor. The middle layer of the probe set and gallery set of the vehicle image is processed, and the response of the middle layer is extracted as the feature. AlexNet is set as the response of the FC7 layer, and ResNet-50 is set as the response of the Pool5 layer. On the basis of image features, cross-camera retrieval is performed, that is, the distance of image features between the samples in the probe set and the gallery set is calculated. The distance is sorted, and the final vehicle rerecognition performance is evaluated against the sorted list.

4. Proposed Double-Channel Symmetric CNN Structure

The vehicle recognition method proposed by the double-channel symmetric CNN structures is described in this section. The overall structure of the model is shown in Figure 1 (taking the AlexNet model as an example). Compared with the existing single-channel CNN model, the proposed double-channel symmetric CNN model inputs two samples at same time, and the input combination forms are more diversified. Each middle layer has the same structure and can be considered symmetrical but does not share parameters with each other. By connecting the last fully connected layer in the double-channel model, each layer in the double-channel model interacts with each other and promotes each other, which can be considered as complementary.

The goal of the network training process of the identification model is to learn an optimal mapping for a given training set, so that the prediction results of vehicles are closer to their real identity (ID). On the one hand, the richer the sample in the training set, the stronger the generalization ability of the model obtained. On the other hand, for a particular vehicle, the difference in appearance is more obvious because it is a vehicle image collected under cross-camera. By combining different vehicle images within a specific vehicle, the samples can complement each other and narrow the differences in appearance. Therefore, the designed structure is more suitable for the deep learning model with stronger discrimination ability to be learned, so as to improve the performance of vehicle rerecognition.

In the proposed double-channel symmetry CNN structure, two vehicle images are input at the same time each time, and the two images belong to the same vehicle. These sample pairs are a pairwise combination of all samples corresponding to the same carrier in a full permutation form. The preprocessing of vehicle image before sending to the network data layer is consistent with the single-channel method. Each convolution layer and the full connection layer have the same structure and settings, and each CNN model is fine-tuned by the pretraining model obtained on the ImageNet dataset. An example of the AlexNet model is showed in Figure 1. The full connection layers of FC6 and FC7 in each road are, respectively, connected with their convolution layers. The full connection layers of FC7 in the two channels are connected in series, denoted as FC7_concat. dimensions; dimensions. Three fully connected layers (double FC7 layers and one FC7_concat layer) are, respectively, connected to the fully connected layer FC8. The number of outputs of the FC8 layer N3 is the same as the total number of vehicles in the training set.

The three softmax loss functions are used as the supervisory signals to guide the network training process, and the sum of the three loss functions is used as the network loss. If the intermediate skeleton of the proposed two complementary symmetrical CNN structures is replaced by the ResNet-50 network, since the last layer of the network is the pool layer Pool5 and not the full connection layer FC7, then Pool5 is used instead of FC7, and the connected Pool5 layer can be denoted as Pool5_concat, where dimensions and dimensions. The network training strategy and the process of the proposed double-channel symmetric CNN structure are the same as the single-channel method.

The process of vehicle reidentification is to use the deep learning model obtained in the process of network training as a feature extractor. It extracts the response of the middle layer (AlexNet is the response of the FC7_concat layer and ResNet-50 is the response of the Pool5_concat layer) as the feature representation of the vehicle image in the probe set and the gallery set. On the basis of image features, the cross-camera search is performed to calculate the distance between the image features in the probe set and the gallery set, and the distance is sorted. The final vehicle recognition performance is evaluated according to the ranking list.

5. Experimental Results and Analysis

5.1. Dataset Construction

The test vehicle dataset is collected by 4 different intersection monitoring platforms, and the installation location is shown in Figure 2. The same angle video is taken every 2 hours at an interval angle of 30°, and a total of 7 angle mp4 format video images are obtained from the front to the back. Finally, a total of 20,160 complex scene multivehicle image sets T are extracted from the video at intervals of 10 seconds. Since the data acquisition takes full account of the problem that the positive sample number encountered by most datasets is zero, the design monitoring installation location is on each exit section of the loop. As shown in Figure 2, the number of image captures for the same vehicle is 2 times regardless of the intersection of any intersection from a to d (except for vehicles that repeatedly enter the road segment). A total of 45,742 identifiable vehicles with pixels greater than 128 are extracted from T and denoted as D. 80% of them are randomly selected to generate of the training set and 20% to generate of the training set.

5.2. Experimental Setup and Evaluation Criteria

Deep learning framework CAFFE [23] is used to implement the proposed method. The hardware configuration used in the experiment is as follows: GTX 1080 GPU, 8 GB video memory, 128 GB memory, Intel core 8-core i7 processor CPU, and main frequency 3.60 GHZ.

Cumulative matching characteristic (CMC) curve, rank-1 accuracy, and mean average precision (MAP) were selected to evaluate the performance of the vehicle reidentification method. The CMC curve represents the probability that the truth value image to be queried will appear in a candidate sequence of different lengths. The rank-1 recognition accuracy rate represents the probability that the queried truth value image appears at the first position of the candidate sequence. The MAP is the average area under the curve of the accuracy rate of all query samples and recall rate, which reflects the overall performance of the vehicle reidentification method.

5.3. Experimental Result

By using the AlexNet and ResNet-50 model framework, the experimental results of the single-channel CNN method and the double-channel symmetric CNN method are compared, as shown in Table 1.

The results show that the double-channel symmetric method has a stable improvement over the single-channel method. In the AlexNet model, the accuracy of rank-1 increased by 5.13%, and the accuracy of MAP increased by 4.26%. In the ResNet-50 model, the accuracy of rank-1 was improved by 0.71%, and the accuracy of MAP was improved by 2.44%.

In addition, on the ResNet-50 model, the accuracy of rank-1 and MAP of the proposed method in this study is 74.36% and 49.55%, respectively. At this time, the performance of this proposed vehicle recognition has reached a higher level.

The proposed method in this study is compared with some existing vehicle reidentification methods, including the traditional manual design method and the deep learning-based method. The specific comparison ends are shown in Table 2. The results show that the proposed method in this study has achieved a competitive performance of vehicle reidentification, which is better than some existing vehicle reidentification methods.

In order to further verify the effectiveness of this algorithm, the existing VeRi-776 dataset [28] is used for validation. The VeRi-776 dataset was captured by 20 cameras over a 24-hour urban area, containing 49357 images of 776 vehicles. Images are captured in a real-world unconstrained monitoring scenario and tagged with different attributes, such as type, color, and brand. Each vehicle is photographed by 2–18 cameras at different points of view, lighting, resolution, and occlusion. In the experiment of this study, each image of 2 cameras was selected as experimental data. The results are shown in Table 3. According to the data, the method proposed in this study achieves a better vehicle rerecognition performance, which is superior to other algorithms.

6. Conclusion

In order to further improve the performance of vehicle reidentification, this study proposes a double-channel symmetric CNN structure vehicle reidentification method. Under the original training samples, this algorithm inputs two samples at the same time, among which each sample has complementary characteristics. At this point, with limited training samples, the combination of inputs will be more diversified, which will enrich the training process of the CNN model. Therefore, the CNN model can be trained more fully, and a deeper learning model with stronger recognition ability can be obtained. The vehicle training map library was extracted from the monitoring video of different intersections, and then, the algorithm in this study was compared with other algorithms. Experimental results show that the vehicle recognition accuracy of the proposed algorithm is higher than other existing algorithms, which verifies the effectiveness of the proposed method.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Hunan Natural Science Foundation (no. 2019JJ40097), in part by the Youth Research Foundation of Hunan Education Department (nos. 20B247 and 17B107), in part by the Outstanding Youth Research Foundation of Hunan Province (no. 2020JJ2015), in part by the Hunan Natural Science Foundation (no. 2019JJ40096), in part by the Hunan Natural Science Foundation (no. 2020JJ4327), in part by the Research Foundation of Science and Technology Bureau of Yongzhou City, China (nos. 2019YZKJ08 and 2019YZKJ10), and in part by the Construct Program of Applied Characteristic Discipline in Hunan University of Science and Engineering.