Abstract

With image analysis as the core for multitarget detection and intelligent tracking, mostly applying the Faster R-CNN or YOLO framework, the MOTA score for multitarget tracking is low in the face of complex working environments. Therefore, further research into computer vision techniques is carried out to design new multitarget detection and intelligent tracking methods. Based on the small-aperture imaging model, the principle of lens distortion was analyzed, and a camera calibration and image calibration scheme was designed to obtain effective environmental images. The attention mechanism is introduced to optimise the structure of deep learning networks, and a computer vision detection algorithm based on this is applied to complete regional multitarget detection. The distance between each target and the body is then measured in combination with binocular vision principles. Finally, the spatiotemporal context algorithm is applied to perform simulation calculations to obtain the multitarget intelligent tracking results. The experimental results show that the mean MOTA score of the proposed technique is 0.87 in the night environment, which is 24.14% and 28.374% better than the neural network-based and machine vision-based tracking methods, respectively; in the daytime environment, the mean MOTA score of the multitarget tracking results of the technique is 0.94, which is 28.72%, and the mean MOTA score of 0.94 for the multitarget tracking results in the daytime environment was 28.72% and 22.34% higher than the other two methods.

1. Introduction

The progress of modern technology has led the automotive industry towards an intelligent trend [1], and it can be said that the level of development of multitarget detection and intelligent tracking technology directly determines the degree of intelligence of the vehicle. As urban traffic scenarios become more and more complex [2], the technical requirements for intelligent driving are becoming higher and higher. To ensure the stability and safety of automated vehicle driving, it is necessary to first detect targets in the driving area and track their movement trends in order to generate highly intelligent driving decisions [3]. At the same time, a key aspect of smart driving vehicle operation is environmental perception. Only a clear enough knowledge of the road environment around the driving area can ensure that smart driving vehicles are integrated into the traffic environment. However, existing multitarget detection and intelligent tracking technologies can be affected by good or bad lighting and weather conditions, resulting in target detection and intelligent tracking effects that do not meet intelligent driving requirements.

With the in-depth research of computer vision technology, computer vision technology with deep learning as the core has started to be applied in various fields. The paper takes this as the research direction and introduces an attention mechanism to further optimise the current computer vision technology by adding a subregion feature library and an aspect ratio feature library to the original detection model to improve the feature representation capability of the computer vision-based multitarget detection model and obtain accurate localisation recognition results. On this basis, a new multitarget detection and intelligent tracking technique is established to stably and quickly track the movement trend of multitarget objects in the vehicle’s surrounding environment. The experimental validation results show that the application of the proposed technique for multitarget tracking results in more accurate tracking.

This paper consists of four chapters. The first chapter is the introduction. The second chapter introduces the design of multitarget detection and intelligent tracking based on computer vision. The third chapter is the empirical analysis, and the designed multi-target detection and intelligent tracking algorithm is tested. The fourth chapter is the conclusion.

2. Multitarget Detection and Intelligent Tracking Technology

2.1. Camera Calibration and Image Calibration Programme

The first step in tracking is identification. A binocular camera with a wide-angle lens is mounted on the front of the smart driving vehicle as the main device for sensing the smart driving environment. Considering that radial and tangential aberrations exist when the camera captures images [4], a camera calibration and image calibration scheme is established with the objective of reducing the position errors caused by lens aberrations as a fundamental part of computer vision-based multitarget detection and intelligent tracking [5]. Most of the imaging of the camera relies on the small-aperture imaging model, in three-dimensional space and the camera imaging plane, respectively, to determine a target point with a corresponding relationship between the two can be expressed as follows:

In the formula, (a, b, c) represents the coordinates of the target point in the world coordinate system, (a′, b′) represents the coordinates of the target point in the camera coordinate system, (, h) represents the coordinates of the target point in the image pixel coordinate system, I represents the scaling factor, R represents the rotation matrix, T represents the translation matrix, and represents the internal parameter matrix.

In the formula, d1 and d2 represent the scale factor on the x and u axes, respectively, (, ) represents the coordinates of the origin, and represents the axis inclination parameter. With (1) and (2), the imaging principle of the camera is described directly, but in practice, a nonlinear aberration model needs to be added to describe the imaging point shift.

In the equation, (, ) represents the ideal coordinate value of the target point, (, ) represents the actual distortion coordinate value of the target point, , , and represents the radial distortion factor, r represents the radius, and k1 and k2 represents the tangential distortion factor. Based on the exact and distorted coordinates, the camera calibration process is completed and reasonable parameter values are obtained.

In practice, it is necessary to combine the Zhang Zhengyou calibration method with a calibration plate composed of black and white squares to take images, build a library of photos for calibration, apply the optimization method to iteratively solve the camera parameters with the minimum error as the goal [6], determine the optimal internal and external parameters and aberration parameters of the camera, and after the camera calibration is completed, calculate the aberration pixel coordinates of the actual photos taken, as well as the ideal coordinates to complete the image calibration.

2.2. Computer Vision Multitarget Detection Algorithms

A calibrated camera is applied to capture images of the smart driving surroundings, and then computer vision techniques are applied for multitarget detection. Considering the computer vision detection technology based on deep learning, it can be influenced by external factors in practical applications, making the detection results biased. The paper applies computer vision principles and introduces attention mechanisms in conventional deep learning networks [7] to establish the multitarget detection framework shown in Figure 1 for feature extraction, feature pooling, and classification regression of camera acquisition images.

According to Figure 1, it can be seen that applying convolutional neural network for computer vision multitarget detection requires first acquiring image convolutional features for classification and regression analysis. In the paper, the subregion feature attention module and aspect ratio feature attention module [8] are introduced into the original multitarget detection model to obtain the computer vision target detection block diagram shown in Figure 2. Introducing subregion feature attention module and aspect ratio feature attention module into the original multitarget detection model can help distinguish single target in multitarget better. Therefore, higher recognition accuracy can be obtained for target recognition.

According to Figure 2, the updated computer vision target detection model contains the attention module, which mainly plays a role in the ROI feature extraction process, extracting features for regular patterns for further processing and combining them with the original ROI pooled features to generate high-quality ROI classification features.

In the formula, p represents ROI, Bp represents ROI pooling features, represents ROI classification features, M1 represents subregion attention feature maps, and M2 represents aspect ratio attention feature maps.

The new attention module contains two attention feature bases, each of which holds the corresponding attention feature activation relationships. Among them, the features displayed within the subregion attention feature library are associated with spatial location information [9], and based on the location of each feature point in the ROI subregion, the formula for calculating the subregion attention salience value is expressed as follows:

In the formula, (i, j) denotes a point in the convolutional feature map, β denotes a subregion, denotes an attentional salient, U denotes a feature vector, and denotes an attentional feature extractor.

The features saved within the aspect ratio attention feature library describe actual feature attributes that directly describe the observed viewpoint and pose morphology of the target object, extracting horizontal and vertical scale differences in the target detection framework and better determining target class differences. A deep learning network structure incorporating attention mechanisms is applied to run computer vision techniques to obtain multitarget detection results during intelligent driving.

2.3. Target Distance Measurement Programme

After the target detection is over, the target distance measurement method is designed based on the binocular vision principle to locate the distance between each target point and the vehicle body. Applying the pinhole model imaging principle [10], image acquisition is carried out during intelligent driving, and each coordinate system of the pinhole model is shown in Figure 3.

In Figure 3, (Oabc) represents the 3D world coordinate system, represents the camera coordinate system, (O’a’b’) represents the image plane coordinate system, represents the plane coordinate points, and represents the 3D coordinate points.

Setting up the existence of a target point in space and the known coordinates of the image point of the target point in the two calibrated camera coordinate systems, combined with the projection matrix, the perspective projection matrix transformation relationship can be expressed as follows:

In the formula, L denotes the left camera, A denotes the right camera, denotes the projection matrix of the left camera, and denotes the projection matrix of the right camera. Based on the left and right camera perspective projection matrix transformation relationships shown in (7) and (8), the spatial coordinates of the target can be deduced from the known image point coordinates and the distance measurement results can be obtained by comparing the coordinate information. It is important to note that camera images in complex environments can contain a lot of noise, which can affect the accuracy of the distance measurement results. In this case, the least squares method can be combined with further solutions to obtain more accurate spatial coordinates of the measured point.

2.4. Multiobjective Intelligent Tracking

Relying on computer vision technology, the correlation between the target object and the local scene needs to be analysed for intelligent tracking of multiple targets for detection and localisation [11], and the spatiotemporal context algorithm based on Bayesian framework is applied in the paper for simulation and calculation to clarify the intensity and location correlation of the target region in the image of the local context, and then the maximisation confidence function [12] is applied to achieve the target location of real-time tracking. In which, the confidence function can be expressed as follows:

In the formula, is the target position, is the target region centre position, is the confidence level, is the normalisation factor, is the bias function, is the scale parameter, and is the shape parameter.

Faced with a multiframe image acquired in real time, the local contextual feature set of the target region it contains can be represented as follows:

In the formula, D represents the local context feature set, represents the selected location, G represents the image intensity, and represents the local context region around the selected location. Based on the concept of local context analysis [13], the multiobjective intelligent tracking model can be described as Figure 4.

According to Figure 4, the STC algorithm-based multitarget intelligent tracking model is essentially a multitarget intelligent tracking task through a maximum confidence function search. The target image frame is analysed to obtain its corresponding spatial context model, and the spatiotemporal context model of the next image frame is represented as follows:

In the formula, denotes the number of image frames, denotes the spatiotemporal context model, denotes the learning rate factor, and denotes the spatial context model. According to (11), it can be seen that the spatiotemporal context model of the next frame can be derived by weighting the spatiotemporal context model for the current frame of the photo and the spatial context model [14]. Therefore, the confidence function calculation formula can be updated as follows:

In the formula, denotes the Fourier transform, denotes the Fourier inverse transform, denotes the convolution operation, and denotes the weighted Gaussian function. Based on the results of the maximum confidence calculation, the position of the target point within each frame is determined. Considering the application of the STC algorithm, it is only possible to describe the change in the target position of a single pixel and obtain the actual displacement of the pixel. To obtain more intuitive target tracking results, it is also necessary to fuse optical flow algorithms [15] to specify the subpixel displacement in consecutive image frames and obtain multitarget intelligent tracking results.

3. Experiment

The design of a multitarget detection and tracking technique is based on computer vision technology, and an experimental analysis is carried out to verify the effectiveness of the technique in practice. A binocular camera is mounted on an ordinary car to perform multitarget detection and tracking in night and day scenes. Based on the detection and tracking results, the validity of the research content in the paper is reflected.

3.1. Camera Calibration

The implementation of computer vision technology needs to be image-based. During the experiments, a camera calibration process is carried out before the camera is fixed to the car in order to capture a more realistic image. First, create a 13 ∗ 14 black and white checkerboard visual calibration grid, each measuring 20 mm ∗ 20 mm, as shown in Figure 5. Print out the calibration grid and paste it onto a flat horizontal board to form a visual calibration board.

When applying the binocular camera to capture images, it is necessary to constantly adjust the angle of the visual calibration plate to obtain multiple calibration plate images, as shown in Figure 6.

The calibration plate images captured by the cameras shown in Figure 6 were loaded simultaneously into the MATLAB software and manually processed through the Calibration Toolbox to extract the corner points contained within each calibration plate image, and the optimum internal and external parameters for the left and right cameras were calculated based on the image corner point information, as shown in Tables 1 and 2.

Among them, indicates the camera focal length parameter, (u0,) indicates the camera principal point position, (l1, l2) indicates the radial aberration, and (s1, s2) indicates the tangential aberration. Through the camera calibration process, in addition to the intracamera parameters shown in Table 1, the external parameters of the right camera relative to the left camera were also obtained, as shown in Table 2.

After the parameters inside and outside the camera have been adjusted, it is fixed to the vehicle and continuous image acquisition is carried out while the vehicle is in motion as the data required for multitarget detection and intelligent tracking.

3.2. Evaluation Indicators

In order to demonstrate the reliability of the design technique in the paper, the Multi-Objective Tracking Accuracy (MOTA) should be selected during this experiment to assess the consistency between the intelligent tracking results and the actual trajectory of the target. The intelligent tracking result output from the application of the techniques in the text is first obtained to form a tracking path containing multiple nodes, and then the true position of each target is investigated to generate an actual running path containing multiple nodes. Comparing the degree of matching between the two paths and analysing the false detections, misclassifications, and incorrect matches that occur during the tracking process, the MOTA score is calculated by expressing the formula as follows:

In the formula, m represents the number of false targets detected during tracking, n represents the number of false targets, represents the number of false match targets, and N represents the number of all targets that appear in the image frame.

3.3. Visual Analysis of Tracking Results

Firstly, a complex scene with high footfall was selected as the experimental scene. The experimental vehicle is controlled to drive through the scene and take video in a night environment, and several images are captured within the video sequence to form a multitarget tracking dataset. Using the multitarget detection and intelligent tracking techniques proposed in the paper, the dataset was analysed to obtain the multitarget tracking results shown in Figure 7.

Figure 7 shows frames 3, 10, 15, and 25, from which it can be seen that the scene contains a large number of pedestrians and that the pedestrian trajectories are not identical, while the tracking technique proposed in the text, when applied, detects essentially all the pedestrians, except in cases where the occlusion is too severe. The comparison of the four frames shows that the position of the pedestrian changes within each frame, but the colour of the detection frame does not change, which indicates a better result for multitarget tracking. The above experiments show that the tracking technique designed in the paper can detect and track multiple targets accurately in complex scenarios.

Afterwards, a weather day with good lighting conditions was selected to conduct a multitarget detection and intelligent tracking experiment on a relatively secluded street. The vehicle is set to drive through the street at an even speed and the binocular camera captures images of the surrounding scene as it moves to form a second experimental dataset. This dataset is processed by applying the techniques in this paper to obtain the multitarget tracking results shown in Figure 8.

As can be seen from Figure 8, there are only a small number of vehicles moving in the area and no pedestrians passing by. Therefore, the car can be considered as a detection target, and the illustration captures frames 18, 26, 32, and 40 to visualise the tracking results of a moving vehicle. Overall, the proposed technique can be applied to quickly detect other vehicles from the moment they come into the camera’s range and mark them with different coloured detection boxes and then keep track of the vehicle’s movement until it leaves the camera’s range. In addition, although the experiment was conducted during daylight hours with few targets, the lighting conditions were complex, containing both shaded and brightly lit areas. However, the detection and tracking results of the techniques designed in the paper were not affected by the light and showed that the application of computer vision techniques based on deep learning networks with the introduction of attention mechanisms ensures the reliability of multitarget detection and tracking.

3.4. Tracking Performance Comparison

To enhance the visualisation of the experimental results, simultaneous experimental analyses were carried out in different experimental scenarios by applying the techniques mentioned in the text, a neural network-based approach, and a binocular vision-based approach, respectively. The targets to be detected and tracked were set to keep increasing, and the detection and tracking results of the different methods were recorded. The variation of MOTA scores for different methods in the night environment is shown in Figure 9.

According to Figure 9, the MOTA scores of the tracking results of the designed methods in the paper do not fluctuate much after the number of targets increases, with an average value of 0.87. The MOTA scores of the other two methods, however, keep decreasing as the number of targets increases. The neural network-based approach reduced the MOTA score from 0.76 to 0.52, with an average MOTA score of 0.66, while the binocular vision-based approach for multitarget detection and tracking in different environments achieved a maximum MOTA score of 0.78 and a minimum of 0.43, with an average MOTA score of 0.62. In summary, the MOTA scores of the intelligent tracking results of the proposed technique improved by 24.14% and 28.374% compared to the other two methods.

The change in MOTA scores for the different methods in the daytime environment was then analysed to form the comparative results of MOTA scores shown in Figure 10.

According to Figure 10, the mean MOTA score for the multitarget tracking results of the proposed technique in the daytime environment is 0.94, while the mean MOTA scores of the other two methods are 0.67 and 0.73, respectively. As a result, the proposed technique improves MOTA scores by 28.72% and 22.34% compared to neural network-based and binocular vision-based methods, respectively.

4. Conclusion

Intelligent driving is the development trend of the future automobile industry. The application of intelligent driving in urban scene depends on the development level of dynamic object tracking technology.

In this paper, the traditional computer vision technology based on deep learning is optimized. By adding attention mechanism, this paper constructs a new multiobject detection algorithm in computer vision to achieve more accurate and fast multiobject detection. Moreover, using the STC model, the results of the intelligent target tracking algorithm established in this paper are also more accurate.

The intelligent tracking method designed in this paper can maintain accurate detection and tracking of multiple targets in complex traffic environment. It performs better than the traditional intelligent tracking method in the environment of multitarget and changing illumination conditions. The intelligent tracking method constructed in this paper is applied to the field of intelligent driving, which is beneficial to enhance the stability of vehicle driving.

Data Availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.