Abstract

The navigation management systems in autonomous vehicles should be able to gather solid information about the immediate environment of the vehicle, discern ambulance from a delivery truck, and react in a proper manner to handle any difficult situation. Separating such information from a vision controlled system is a computationally demanding task for heavy traffic areas in the real world environmental conditions. In such a scenario, we need a robust moving object detection tracking system. To achieve this, we can make use of stereo vision-based moving object detection and tracking, utilizing symmetric mask-based discrete wavelet transform to deal with illumination changes, low memory requirement, and fake motion avoidance. The accurate motion detection in complex dynamic scenes is done by the combined background subtraction and frame differencing technique. For the fast motion track, we can employ a dense disparity-variance method. This SMDWT-based object detection has a maximum and minimum accuracy of 99.62% and 94.95%, respectively. The motion track has the highest accuracy of 79.47% within the time frame of 28.03 seconds. The lowest accuracy of the system is 62.01% within the time frame of 34.46 seconds. From the analysis, it is clear that this proposed method exceptionally outperforms the existing monocular and dense stereo object tracking approaches in terms of low computational cost, high accuracy, and in handling the dynamic environments.

1. Introduction

Recently, object tracking has a vital role in most of the computer vision applications like man-machine interface, robotic vision, and intelligent security system. In medical imaging, object tracking is used for the diagnosis of diseases [1]. Even though heavily investigated for decades by many researchers, an accurate and high-performance motion track approach is still far away from most of the real-time requirements. The degree of difficulty in this task highly depends on the dynamic changes in the environment as well as the type of object to be detected and tracked. In the complex dynamic traffic situations, the driver assistance system (DAS) must accurately figure out the scene and react properly to handle any difficult situation. The navigation management system in the autonomous vehicles normally depend on various range measurement techniques like GPS, laser rangefinder, video cameras, and several subsystems to discern where the vehicle is going and to take crucial decisions according to the data coming from these sensors [2].

Similar to human eyes, cameras and their related subsystem can cognize the scenes, which no other sensors can handle. Conventional approaches for moving object detection are background subtraction [3, 4], temporal differencing [5], statistical methods [6], and optical flow method [7, 8]. In the real world scenario, these methods are sensitive to radiometric variations, sensor noises, and fake motions. Often, the discrete wavelet transform (DWT) [9] and low resolution techniques are used to handle these situations [10]. The conventional DWT schemes can be performed only at a higher computational cost. The increase in the level of decomposition used to meet the low memory requirements often ends in incomplete object detection. The symmetric mask-based discrete wavelet transform (SMDWT) is an apt choice to meet these demands, as it is always associated with attractive characteristics such as shorter path, independent subband coding, and fast computation [11, 12]. Generally, these approaches are done on monocular videos and it often face challenges such as multiple object occlusion, shadow interference, and radiometric changes. Also, the horizon of a single camera should be wide enough to cover the objects in motion. A multicamera system can accurately locate the moving objects without any ambiguity [13]. Accurate localization and segmentation of the objects of interests can be done with the help of depth information extracted from the stereo views [1416]. This study is about a fast and reliable moving object tracking system using a dense disparity-variance method. Precise motion detection is accomplished by combining the background subtraction and the frame differencing method, incorporating the advantages of SMDWT.

In the recent developments in video acquisition technology, multiple view systems dominate over the monocular ones. With the use of stereo vision technology, 3D information of the objects can be estimated accurately at a subpixel level, which can be utilized for more reliable object tracking. Stereo systems can be of either wide-baseline or short-baseline systems. Feature-based correspondence is usually done on wide-baseline systems [17, 18] that results in sparse disparity. The primary challenge associated with the wide-baseline systems is the large viewpoint-related occlusions. The dense depth map obtained from short-baseline systems [1921] can be used for precise segmentation of the objects of interests.

The stereo vision system is used in multiobject tracking [22]. Single camera and the multiple sensor fusion are used to track humans [23]. For object tracking at traffic roundabouts, a stereo vision system is employed [24]. The dense stereo vision system is used for pedestrian tracking [25]. Stereo inputs are utilized for identifying and segmenting moving objects [26]. Stixel World representation of the traffic scenes are used for autonomous systems [27]. By incorporating the epipolar constraints, Andre et al. [28] used wide-baseline stereo vision for object tracking. Kalman filter is used for stereo video-based object tracking [29]. The adaptive particle filter is employed for stereo-based motion tracking [30]. Object detection based on the classification of feature points in the 3D space is suggested for self-driving cars [31]. To manage the vehicle tracking in urban traffic, cuboids estimated from stereo views are used [32]. Stereo vision and LiDAR technology for autonomous driving vehicles is presented in [33]. Dense stereo is used for vehicle tracking in urban road intersections [34]. RANSAC algorithm-based tracking [35] and moving object detection [36] used stereo pairs for motion estimation. Vehicle tracking suggested in [37] uses AdaBoost detectors and optical flow. In most of the existing tracking methods, illumination variations, fake motions, dynamic scenes, accuracy, and computational cost are the main challenges.

3. Features and Contributions

This is the first stereo vision-based fast motion object detection and tracking method for dynamic environments. It incorporates the advantages of symmetric mask-based discrete wavelet transform, combined background subtraction and frame differencing technique, and dense disparity-variance method. The main contributions of this study include the following: (i)The use of SMDWT to deal with the illumination changes, low memory requirement, and fake motion avoidance(ii)The accuracy and the reliability of motion detection are increased by combining background subtraction and frame differencing methods. This combined method can handle dynamic scene changes, and it does not need to frequently update the background frame. Instead, it is to be updated only on the base of a threshold distance covered by the sensor(iii)The dense stereo disparity is applied for removing unwanted outliers in the region of interest and thus enhancing the accuracy of object detection. Finally, the dense stereo disparity-variance method is used for fast object tracking and the use of the dense depth map increases the accuracy of this distance-based object tracking

Apart from these, the SMDWT-based object detection attains an accuracy of 99.62%. The lowest detection accuracy of the system is 94.95% for the compressed input image size of 80 × 110. It gives tracking results within a low time frame compared to the lifting-based discrete wavelet transform (LDWT) method. This novel dense disparity-variance method uses only the maximum variance value of pixels within a window, instead of taking pixel by pixel values which gives faster motion tracking results when compared with the current approaches. This system integrates the advantages of combined background subtraction and frame differencing technique, SMDWT, and dense stereo disparity-variance object track. It has a highest tracking accuracy of 86.4% within a time frame of 35.23 seconds, and the lowest tracking accuracy of the system is 62.01% within a time frame of 34.46 seconds. The results indicate a better performance of the entire system, in terms of accuracy, computational cost, and in handling dynamic scenes than the usual DWT-based approaches and the existing dense stereo object tracking approaches.

4. Motion Detection and Tracking Using SMDWT and a Dense Disparity-Variance Method

This stereo vision-based object tracking method incorporates the features of SMDWT in dealing with radiometric invariance, low memory requirement, and fake motion avoidance. The salient features of the system include accurate object detection without any ambiguity, fast object tracking of moving objects, and accurate region of interest (ROI) selection. The accuracy of this distance-based tracking is perfected by the use of the stereo vision-based dense depth map. Further, this system does not need to frequently update the reference frame. It is to be updated only if the depth value of the ROI becomes greater than that of a threshold value.

The stereo information is utilized in two modules: (i) in ROI outlier removal to improve accuracy and (ii) in the disparity-variance-based motion tracking. For the disparity computation, the sum of absolute difference (SAD) algorithm is used. Robust object detection in dynamic traffic scenes is done using combined background subtraction and frame differencing technique. Absolute object detection and fast motion tracking are accomplished by the dense disparity-variance method. The entire 3D visual tracking module consists of moving object detection, depth estimation using SAD stereo matching, and the tracking of multiple moving objects. Figure 1 shows the system overview.

4.1. Symmetric Mask-Based Discrete Wavelet Transform (SMDWT)

Convolutional or FIR filter bank structure discrete wavelet transform (DWT) used to detect moving parts in real world environments is done at a high computational cost, and hence, they are not an apt choice for high-speed video processing applications [38]. In comparing with convolution-based transform, lifting-based DWT (LDWT) needs only 50% less computational complexity from video compression methods [39, 40]. Though LDWT has excellent reconstruction property, it requires more transpose memory. To reduce this requirement, the mask-dependent processing SMDWT can be used for motion detection and tracking [41, 42]. In this work, symmetric mask-based discrete wavelet transform (SMDWT) is applied on both stereo frames using the coefficient derived from the 2-D 5/3 integer LDWT. In order to conduct spatial filtering at less computational cost, four masks, 3 × 3, 5 × 3, 3 × 5, and 5 × 5, are used. On further optimization, these four subband processings will lead to a reduced temporal memory requirement and speedy results. Using these masks, LL1, LH1, HL1, and HH1 subbands are obtained. Of these, the LL1 subband is devoid of high-frequency components such as fake motions and fast illumination changes. Therefore, in this stereo object tracking system, the LL1 subband is utilized for spatial filtering. Table 1 represents the coefficients of LL1 mask. Figure 2 shows the schematic of two-dimensional SMDWT. LL1 subband can be calculated using the following equation.

4.2. Motion Detection

The steps involved in motion detection are (a)background subtraction and frame differencing(b)binarization and OR operation(c)morphological operations

Motion detection is followed by motion disparity computation which can be used to increase the accuracy of motion tracking.

4.2.1. Combined Background Subtraction and Frame Differencing Method

The background subtraction (BS) technique detects the moving objects from the present frame based on a reference frame. Though the motion mask vector can be estimated from the background subtraction, it is susceptible to illumination changes and other radiometric changes. This can be overcome by incorporating the frame differencing (FD) method which can handle dynamic scene changes. Thus, in this proposed method, the motion mask data is obtained from the combined BS and FD method. Only the steady-state objects in the background are considered for the reference frame, and the current frame is subtracted from the reference for getting the objects in motion. In FD, the moving parts are separated out from the successive frames in a video stream. The background subtraction and frame differencing can be done by using the following equations. where , , and . and represent the number of rows and columns in a frame, and is the number of frames in the video. is the background frame, and is the frame difference.

4.2.2. Binarization and Logical OR Operation

The data obtained from the BS and FD method is combined by the binarization and logical OR operation. The threshold value for binarization may change from one frame to another. The equation (3) are used to calculate the mean and standard deviation .

From these values, the global threshold “” for a particular frame can be calculated using the following equation:

is the result of frame differencing, is the mean value of , is the standard deviation, is the threshold value, and 0.085 is the constant value obtained from trial and error method. , the result of OR operation, can be obtained using the following equation:

4.2.3. Morphological Operations

Undesirable features in a binary image can be eliminated by morphological operations. Some motions are too feeble, which may be taken as nonmotion pixels and will create holes in the motion mask. To generate accurate motion mask, the morphological closing is applied.

4.3. Motion-Based Disparity Computation

The motion mask obtained after the morphological operation may contain the outlier parts from its surroundings. Accordingly, these outlier portions lead to the incorrect estimation of the motion parameters. The following steps are involved in the motion-related region of interest (ROI) generation and the motion-based disparity computation.

4.3.1. ROI Extraction

From the motion mask and input images, ROI can be generated using the following equations:

Here, is the motion mask obtained from the morphological operation. and are the left and right input images. and are the left and right ROI images. The unwanted outliers in the misaligned ROI can be removed on the basis of disparity, i.e., the estimated motion vectors were converted into disparity vectors.

4.4. Sum of Absolute Differences (SAD) Stereo Correspondence

In the SAD-based correlation approach, the intensity of pixels within a window taken from the reference image is matched with its corresponding one in the target image [42]. The 2-D correspondence search can be brought down to a 1-D search (along the direction only) if the input image pairs are rectified [43].

Here, all the epipolar lines in two views are in parallel with the baseline, and the search is to be done along the horizontal () direction only. Let represent the amount of pixels in either direction from the centre pixel within a window of size . The cost function SAD used for similarity measurement is given by the equation (7). Disparity (), the shift in coordinate of the conjugate pair, can be obtained using equation (8). where be the intensity of the pixel at in the left frame and the right frame. SAD values are computed for all pixels within the window, and the “winner” is the disparity corresponding to the SAD minimum.

The disparity value of the pixel window around the centre pixel changes from zero to maximum disparity value “.” By sliding the reference window over the right frame, the search can be carried out. This process repeats for individual block until the disparity value is calculated for the entire frame and the resultant image is called disparity space image (DSI). Here, the DSI corresponding to moving object is represented as . The images obtained after SMDWT are used as the input sequence for this correspondence search. The depth () can be computed by triangulation based on the following equation. where , , and represent the camera focal length, the baseline distance between the camera centres, and the computed disparity value, respectively. This depth value can be used to identify moving objects that come below the threshold distance. In this work, this threshold distance is taken as 5 m.

4.4.1. Removal of Outlier in ROI

The unwanted outlier in ROI can be removed based on the following equation: where represents the motion mask of the th frame. The disparity of the motion vector is used to eliminate the outliers, and is the threshold level of the motion mask vector. The mean value and variance of each frame are updated periodically.

4.5. Variance-Based Motion Tracking

In this variance-based motion tracking, the variance of pixels is calculated for a window of size five taken from the ROI. The variance value can be calculated based on the following equation: where is the motion mask obtained after removing the unwanted outliers. The window slides horizontally in order to cover the entire ROI. Here, only the maximum variance value of pixels with in a window is considered, instead of taking pixel by pixel values. This window-based approach gives faster motion tracking results, and the aggregation of this “variance-max” is considered as a moving object. A boundary is fixed for locating the moving object to point out the object in the present input frame. Finally, the depth of this object from the camera is calculated using the disparity value obtained and a decision can be taken depending on this depth value for driver assistance.

5. Results and Analysis

In this work, the algorithm is tested and evaluated using the data sets available on KITTI Vision Benchmark Suite [44] and DAIMLER pedestrian segmentation benchmark [45] using Intel® 2.10 GHz Core™ i3 2310M CPU, 4 GB RAM, and MATLAB. The original image frame sizes are in the RGB system. Table 2 details the video frames used for the analysis. By applying SMDWT, every input image is compressed to a size of 80 × 110. For example, for the input video of size 320 × 440, this work uses the 2nd level decomposition and for the video size of 640 × 880, the 3rd level decomposition was used. Even though the increase in the level of decomposition gives faster results, it will decrease the accuracy. There always exists a trade-off between accuracy and speed.

5.1. Result and Analysis of SMDWT

By applying SMDWT, the noise and fake motions in the input video sequence are removed. Figure 3 shows the generated motion mask using the combined BS and FD method and LL1 band of SMDWT. From the Figure 3(c), it is clear that the generated motion mask failed to retain the shape of the object. This can be overcome by deploying LL3 subband instead of LL1, since LL3 subband can retain the moving objects more effectively. Figure 4 shows the third level decomposition result of SMDWT. Figure 4(a) shows the left view of input video frame, Figure 4(b) shows the 3rd level decomposition result of SMDWT, and Figure 4(c) shows the LL3 subband image. Object detection performance, expressed in terms of precision rate, can be calculated using the following equation:

Table 3 illustrates the accuracy comparison of moving object detection using SMDWT, LDWT, and a median filter. It can be seen that the system attains a maximum detection accuracy of 99.62% and minimum accuracy of 94.95% for the decomposed input image size of 80 × 110, which is higher than the median filter.

5.2. Results of SMDWT-SAD Correspondence

The accuracy of SMDWT-SAD correspondence is tested by using the rectified stereo image pairs provided by Middlebury stereo data [46]. The images are first preprocessed by SMDWT, and then, the SAD correspondence is done on images taken under different exposure times and different illumination conditions. Here, the maximum disparity level is taken as one-third of the maximum value of ground truth disparity [47].

5.3. Performance Evaluation

The performance of stereo algorithm and the quality of results can be evaluated by varying certain parameters [48]. Root-mean-squared error (RMSE) can be calculated using the following equation. where is the obtained disparity, is its ground truth value, and represents the total number of pixels in the disparity map.

Figure 5 shows the comparison of RMS error in the disparity map obtained from SMDWT-SAD and SAD, using the input images taken under different exposure times and different illumination conditions. Figure 6 shows their corresponding comparison between the computational time requirements for disparity computation. From the comparison, it is evident that SMDWT-SAD is more accurate when compared to the conventional SAD correspondence. Also, the time taken for the SMDWT-SAD method is much less compared to SAD. Thus, this novel SMDWT-SAD method is more accurate and fast for illumination invariant disparity computation in a stereo-based object tracking approach.

5.4. Results of Object Tracking

The proposed system is tested and analyzed qualitatively and quantitatively using various input video frames. Figures 7(a)7(i) show the generation of motion mask using the combined BS and FD method and SMDWT.

Figure 8 demonstrates motion tracking results. Here, Figures 8(a)8(c) show the tracking results of three consecutive frames of video 1. Figures 8(d)8(f) show the tracking results of different frames in video 2. In this, the value indicated at the centre of the boundary boxes shows the disparity value at that point.

Figure 9 shows the results of ROI-based motion tracking in video 1. Here, the red circle indicates moving ROI. Table 4 shows the comparison of SMDWT, LDWT, and a median filter based on speed and tracking accuracy. The video 5 attains the maximum precision rate of 86.4% within a time frame of 35.23 seconds. The lowest accuracy of the system is 62.01% within a time frame of 34.46 seconds. From this, it is evident that even though the tracking accuracy of SMDWT and LDWT is the same, SMDWT is faster than LDWT.

5.5. Qualitative Analysis

Without using SMDWT, there are many false detections due to close and occluded vehicles, which are shown Figure 10(a). By incorporating SMDWT, these issues are eliminated as shown in Figure 10(b). In Figure 10(c), the false detection errors occurred due to fake motions and bad illumination conditions of video 2, i.e., even though the objects under motion are tracked, they are fragmented. Incorporation of LL2 subband image avoids this false detection, which is shown in Figure 10(d). Many false detections and erroneous tracking occurred in Figure 10(e) are eliminated using the SMDWT-LL2 subband and variance method which is shown in Figure 10(f).

5.6. Quantitative Analysis

Table 5 demonstrates the results obtained from the quantitative analysis of our system. The motion tracking accuracy of SMDWT variance and SMDWT mean is computed and compared for various video sequences. The system attains a tracking accuracy (TA) of 79.47% by using the SMDWT variance method. Compared with the SMDWT mean tracking method, the SMDWT variance method is more accurate.

5.7. Comparison with the Existing Method

Table 6 shows the tracking accuracy comparison of the proposed system with other existing monocular and stereo methods. The monocular system detects moving objects based on colour histogram and local steering kernel for the object tracking [49]. Stereo LSK [29] uses a disparity histogram for object tracking. The particle filter method incorporating the appearance-based models has been used for visual tracking [50]. In the 1 method [51], the particle filter approach is used for visual tracking. From the comparison table, it is evident that our SMDWT dense disparity-variance method is more accurate than the other methods. The tracking accuracy of our proposed method is higher for all the input video sequences compared to monocular and stereo approaches.

6. Conclusion and Future Works

This work presents a novel stereo vision-based moving object tracking system for dynamic environments. This fast motion tracking approach employs SMDWT to deal with illumination changes and fake motions. In this, the reliable motion detection is achieved with combined background subtraction and frame differencing technique. The incorporation of the threshold level in this combined method can handle dynamic scene changes without frequently updating the reference frame. The dense disparity-based ROI selection increases the accuracy of the object detection. Apart from all the other features, the dense disparity-variance-based motion tracking improves the tracking accuracy and speed. The performance analysis proved the effectiveness of the system in terms of accurate object detection in dynamic scenes, fake motion avoidance, fast motion tracking, and accurate ROI selection. Thus, this fast illumination invariant system can be used for real-time motion tracking of dynamic scenes taken, even under bad environmental conditions.

Most of the motion tracking systems in real-time applications such as autonomous vehicles and video surveillance systems demand accuracy and speed simultaneously. As the demand for the driver assistance system in autonomous technology is increasing day by day, the incorporation of cognitive approaches in the vision technology will help to tackle the uncertainties and technical challenges in this area. An improvement over this system looks forward to a fast deep learning approach in dealing with object discrimination and occlusion handling. Another extended approach can be object tracking with the appearance-based object model using multiple stereo cameras. Such approaches will provide immense growth in the field of autonomous vehicle technology.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.