Abstract

Background modeling plays an important role in the application of intelligent video surveillance. Researchers have presented diverse approaches to support the development of dynamic background modeling. However, in the case of pumping unit surveillance, traditional background modeling methods often mistakenly detect the periodic rotational pumping unit as the foreground object. To address this problem here, we propose a novel background modeling method for foreground segmentation, particularly in dynamic scenes that include a rotational pumping unit. In the proposed method, the ViBe method is employed to extract possible foreground pixels from the sequence frames and then segment the video image into dynamic and static regions. Subsequently, the kernel density estimation (KDE) method is used to build a background model with dynamic samples of each pixel. The bandwidth and threshold of the KDE model are calculated according to the sample distribution and extremum of each dynamic pixel. In addition, the strategy of sample adjustment combines regular and real-time updates. The performance of the proposed method is evaluated against several state-of-the-art methods applied to complex dynamic scenes consisting of a rotational pumping unit. Experimental results show that the proposed method is available for periodic object motion scenario monitoring applications.

1. Introduction

Background modeling plays a fundamental and important role for foreground detection in application of video analysis, especially intelligent video surveillance, which includes real-time tracking [1] and event analysis [24]. The purpose of background modeling is to remove interference from the background and to detect and extract the foreground moving target, to meet the needs of video surveillance. Classic foreground detection algorithms, such as frame difference [5], optical flow [6], and background subtraction [7], are implemented mainly through extracting interframe motion information, detecting optical flow change or background modeling. However, in the application of a dynamic background scene, which includes dynamic noise such as swaying branch, moving water, and the shaking camera, the accuracy and robustness of foreground modeling is an efficient way to obtain foreground objects. Over the past couple of decades, many researchers have presented diverse background modeling methods to identify foreground objects in a video. The general steps [8] of these background modeling methods are as follows: First, set up a background model by using the first of first few frames of the video. Second, compare the background model to the current frame to obtain the foreground object. Finally, update the background model. Background modeling methods can be categorized into two groups: parametric and nonparametric methods. The most widely used parametric methods are GMM [9] and its modified version. Nonparametric methods, ViBe [10], KDE [11], SACON [12], PBAS [13], and so on, have good adaptability to dynamic background modeling. These methods have good effects on dynamic background, illumination changing conditions, and jitter induced by the sensor. However, for the scene with periodic recurring changes that have the same scale as the foreground object, periodic recurring background objects are easily detected as foreground objects by mistake when using traditional background modeling methods, which lead to a high alarm rate. As shown in Figure 1, the movement of the pumping unit is a special kind of background interference. In this case, the moving pumping unit should be regarded as part of the background, but the existing background modeling method will mistakenly detect it as foreground, which will lose the purpose of intelligent video surveillance.

This paper proposes a novel background modeling method for foreground segmentation that is different from all previous works, particularly for dynamic scenes having the same scale movement of the background object as the foreground object. Our algorithm models the background with extracted dynamic pixels (BOWED). In the proposed framework, the ViBe method is employed to extract both dynamic and static pixels. The object of periodic motion in the background is detected as dynamic pixels, and KDE is used to build a new background model for these dynamic pixels after they are extracted. For a current frame, each pixel has two background models: a dynamic background and a static background. This modeling method has high efficiency while ensuring accuracy in detecting objects. To our knowledge, this is the first method that uses both ViBe and KDE for the purpose of pumping unit surveillance.

The exposure of the material is structured as follows. Section 2 provides an overview of the foreground detection and background modeling method. Section 3 presents a detailed account of the BOWED. Subsequently, Section 4 describes profound experiments on surveillance videos of a pumping unit to verify the validity and feasibility of the proposed method. Finally, Section 5 provides conclusions.

Over the past decades, many algorithms were proposed to tackle problems with accuracy in moving object detection [14]. Frame difference is one of the most common methods for moving object detection. It uses subtraction of two or three sequential frames to detect moving objects based on differences in pixels and obtain objects by a threshold value. The method does not require background modeling because the previous frame is used as the background model, which is easily implemented and fast to execute. However, it is sensitive to image noise and not sensitive to illumination changes. Additionally, precision in selecting the threshold value is critical, as an unsuitable threshold will result in an unsatisfactory outcome [15, 16]. Optical flow is also a classic method. The main task of optical flow is to calculate the optical flow field, in the appropriate condition, by estimating the motion region through spatiotemporal gradient of image sequences and then detecting and segmenting the motion target from background by analyzing changes in the moving region. This method can detect the moving target without having information about the scene beforehand. It can obtain the moving object completely and performs well in cases where the camera is moving. However, this method has shortcomings such as complexity and longtime calculation, which does not fulfill the requirement for real-time [17].

The main idea of the background subtraction method is to build the background model, compare the current frame against the background model, and then detect moving objects according to their differences in value. Over the last few years, various background subtraction methods have been tried to improve upon processing speed, reliability, and real-time performance [1820]. GMM is one of the most classic methods. Stauffer and Grimson [9] proposed a Gaussian mixture model (GMM) for background modeling in cases of dynamic scenes, illumination changes, slow-moving objects, shaking trees, and so on. In addition, an adaptive mixture of Gaussians can respond better to cases of changes in lighting conditions. This method can cope with the problem of multimodal background objects such as waving trees and shadow [9, 21]. However, each pixel has a fixed Gaussian distribution, which requires a large amount of system resources to process. The temporal median filter (TMF) is a nonlinear digital filter, and it is usually used to remove noises from images and signals. TMF assumes that the values of the background of an image sequence are repeated frequently in a time series, which is useful in image processing, and it can filter noises, background models, and so on. In this method, a fixed scene is required to avoid noises when modeling the background. Compared with linear filters, it has higher computational complexity, and its processing speed is not as fast as linear filters [2224].

Barnich et al. [10] proposed the ViBe method, which first applies a random aggregation to background extraction. In building a samples-based estimation of the background and updating background models, it uses a novel random selection strategy which indicates that information between neighboring pixels can propagate [25, 26]. In this method, the background model is initialized by the first frame, and the model maintenance strategy uses a propagation method with a random updating method. ViBe is superior to many methods because it uses samples as background models, which can effectively represent changes to the background in recent frames. Moreover, it has good robustness and effectiveness for dynamic backgrounds. However, foreground detection can be easily influenced by noise and illumination variation. In ViBe method, the first frame of video is used to initialize the background model. denotes the value of the pixel x at time t. Therefore, t=0 indexes the first frame, and for each pixel in the first frame, a set of samples is randomly selected from its 8-connected neighborhood. Then, a sphere of radius R centered on is defined. # denotes the number of which the samples and the sphere intersect. # is compared to , as (1) shows [25]:where denotes the threshold. If # is above , the pixel is regarded as background; otherwise, it is foreground, N=20, R=20, and [10]. In this method, the background does not need to update each pixel model for each new frame. When a pixel value is deemed as background, it has a 1 in 16 chance of being selected to update its pixel model.

The pixel based adaptive segmenter (PBAS) was proposed by Hofmann et al. in 2012 [13]. This algorithm, a nonparametric model based on pixels, combines the advantages of SACON and ViBe while making some improvements. In PBAS, the thresholds of each pixel are pixel dependent and are determined by the average value of the minimum values, which are obtained by N distance values from the recent N frames. PBAS has realized nonparameter moving object detection, and it is robust to slow illumination variations. However, its large calculation quantity causes slow processing speeds. In recent years, St-Charles et al. [27] proposed a high performing method which introduces a new strategy that focuses on balancing the inner workings of a nonparametric model based on pixel-level feedback loops. Moreover, in the same year, they presented an adaptive background subtraction method that was derived from the low-cost and highly efficient ViBe method, and it includes a spatiotemporal binary similarity metric [28].

In 2000, Elgammal et al. presented a novel method based on kernel density estimation (KDE) [11]. It is a nonparametric method that is used for estimating the probability density function. This method eliminates the step of estimating the parameter because it depends on previously observed pixel values, and it is not necessary to store the complete data. Moreover, KDE can address multimodal, fast-changing backgrounds. It is very common and has often been applied to vision processing, especially in cases where the underlying density is unknown, due to its property of converging to any probability density function (pdf). Because KDE can make histogram smooth and continuous, Elgammal et al. applied KDE to build statistical representations of background and foreground, to address the varying problem of pdf [29]. Tavakkoli et al. trained KDE with a full kernel bandwidth matrix for each pixel to build up a unary classification [30]. Wu Wang et al. applied nonparametric KDE on the MoGLOH descriptor to eliminate feature noise [31]. Jeisung Lee and Mignon Park used the first frame to initialize KDE and subsequently updated it at every frame by controlling the learning rate on the basis of image storage reduction [32]. In KDE, it is difficult to choose the proper bandwidth in order to achieve the best probability density function. Furthermore, it requires high computation costs during foreground detection processing [33]. In this method, the background model is modeled from the last M samples of each pixel. Therefore, M is the number of data points of the samples. The estimate of the density function can be obtained bywhere is a kernel function and its bandwidth is h. denotes the data points of samples, . Gaussian kernel function is adopted in this paper; hence, (2) becomeswhere and are pixel value and sample value of the observed pixel point, respectively.

is the global threshold where the pixel is considered a foreground pixel ifotherwise, the pixel is a background pixel.

3. The Proposed Method

In the proposed method, ViBe is used to segment each image into static background and dynamic background regions. So that each pixel point has two background models: static model and dynamic model. The color of each pixel is represented in the YUV space and model only on the luminance (Y) channel. The structure of the framework is illustrated in Figure 2.

In the process of background modeling, ViBe is used to extract the foreground pixels of each input video frame, and then OR operation of pixels is used to segment static background and dynamic background regions. In the dynamic region, sample values of each foreground pixel determined to be a foreground point by ViBe processing are then used in KDE background modeling. This method not only reduces the amount of calculation but can also avoid deadlock caused by false detection in the process of selectively updating the dynamic background model. However, in the proposed method, two background models are adopted; through ViBe method, the pixel point of static background model can be judged whether it is background, thereby solving the deadlock problem. KDE is utilized to segment the target foreground object from the dynamic background region. A dynamic threshold is used to detect moving objects. Since the threshold is computed by a global probability density function, it has higher accuracy and adaptive ability and can also accurately identify moving targets in conditions where a pixel contains complex information. For background updates, the static background model adopts the classic ViBe update strategy and the dynamic background model uses a strategy of combined regular and real-time updates, in that the background is updated periodically under normal conditions, but if a moving object is detected, the background is updated immediately. The process of choosing samples to replace old example is fair and random. To our knowledge, this is the first method to detect a foreground moving object under the disturbance of the same scale movement object.

3.1. Dynamic Background and Static Background Segmentation

Figure 3 illustrates the procedure for segmenting dynamic background and static background, which is also the procedure of background modeling. is the pixel of an input image, where represents the frame, is the coordinate of the pixel, and t is the frame number. Static pixel and dynamic pixel can be obtained after each frame is processed by the ViBe method. Then, static background and dynamic background can be obtained through N processed images,whereobviously, is a binary image and is also called the image mask. In the matrix , the pixel value is 1 for the dynamic background region and 0 for the static background region.

3.2. Dynamic Background Modeling with Kernel Density Estimation

The dynamic pixels of video frames have two categories: One is the moving background caused by wind, swaying grass, water ripples, or a regular moving object; these pixels are real background. The other is changes in the dynamic background caused by foreground movement, which is segmented in the proposed method. To achieve segmentation of foreground movement from a dynamic background, the area of dynamic pixels needs to be further processed. In the proposed method, the KDE method is used to model dynamic background in order to extract the real moving foreground region. The procedure of modeling pixels in a dynamic background area based on KDE is shown in Figure 4. For the pixels on coordinates, the foreground detection method of ViBe is used to extract a collection of its dynamic pixel samples. The set of each pixel is , where is the dynamic pixels number in N frames training images. Then, the probability density curve of dynamic background samples can be derived by taking (2). The set of dynamic pixel samples is as follows:

If the probability density of the new dynamic pixel is less than the set threshold and satisfies , then the pixel is considered as the foreground target point.

3.2.1. Determination of Optimal Bandwidth

Bandwidth refers to the scope in which an egocentric sample can work in the overall probability density; therefore, it is critical to choose a proper bandwidth [34]. Elgammal et al. proposed a classic method to calculate bandwidth: For a pixel, the median absolute deviation (MAD) of its adjacent sample values is used to calculate bandwidth dynamically [11]:

However, in this method, the absolute differences of adjacent sample points need to be ranked, which is time-consuming and does not fully utilize each sample point. Therefore, the mean of absolute differences between pixel values of adjacent sample points is used in (8) to calculate the bandwidth. The formula for calculating bandwidth iswhere is the absolute difference between adjacent samples. In addition, the standard deviation of sample points can be regarded as bandwidth, as is shown below:where is the sample average and is the number of sample points. The method has strengths such eliminating the need for sorting, saving on storage space, fast computational speed, higher accuracy, and a combination of information for each sample.

In most cases, distribution of the pixel along the time axis is normal distribution, where (10) is the optimal estimation of bandwidth; otherwise, (9) has stronger adaptability. These two methods are combined in the proposed method: At the initialization stage of the algorithm, the skewness and the kurtosis of pixel point are inspected based on mathematical statistics to estimate whether it matches Gaussian distribution. Equation (9) will be adopted if the distribution of the pixel along the time axis is not normal distribution; otherwise, (10) will be taken, as it increases accuracy and robustness of bandwidth estimation.

3.2.2. Selection of Dynamic Threshold

Suppose that dynamic background model of the pixel contains sample points, then the global probability density can be obtained through the following way. The probability of each sample composes kernel density probability. In the proposed method, the Gaussian function is used as the kernel function in kernel density estimation; so, for the current samples, the farther away from the center point, the less effective it is. In the same way, the whole probability density function which is composed of all samples meets the regulation, too. The probability density curve is shown in Figure 5:

The distribution model of dynamic pixel points can be considered as the weighted sum of all samples, according to (3) means. When a dynamic pixel appears, its probability can be calculated from (3), and (4) is used to determine whether it belongs to the foreground point. The threshold value to be determined can be seemed as a critical value of kernel density probability function. If such critical value can be found from the total probability density function, then the dynamic threshold value can also be obtained. As shown in Figure 5, at both ends of the probability density curve, the points whose value is small enough can always be found, such as and which are the total probability density value of and in Figure 5. Therefore, the values of these two points can be used as the threshold. Pixel values of M samples are sorted from small to large, as to in the figure, then and can be obtained as follows:The total contribution of samples is considered as the Gaussian function, and all Gaussian functions have the same standard deviation and the same weight. We set D=2.5, for a sample ; if the distance between the observed value and is larger than 2.5, according to the standard normal table , the sample at has little effect on total probability density.

The sample values in the background model are sorted from small to large and the weights of and are the smallest of all the samples. Subsequently the total probability density of point, in which one of and is extended out 2.5 times bandwidth, is regarded as the threshold. or is plugged into (3) to calculate dynamic threshold . Note that, in fact, the probability densities of and may not be equal, because of unknown population distribution, but their difference is not significant, so either or can be used to calculate the threshold. Either one simply needs to pick out the minimum or the maximum value of the samples, and it is not necessary to sort all of the samples. For example, the dynamic threshold can be obtained by plugging into (3),

3.3. Updating the Background Model

In this paper, there are two types of background model updating methods due to each pixel containing two kinds of background models. The static background model is used to segment the dynamic background region from the nondynamic background region. It can follow the changes of environment over time and prevent false detection of a moving object as static, such as in the case of a background object disappearing. The ViBe update policy is used in updating the static model. The pixels in the dynamic background area are the updated target of the dynamic background model and are processed by the method that combines regular updates and real-time updates.

After initialization, the regular update of dynamic background model begins, and the interval is preestablished. Once moving pixels are detected in the dynamic background model, their values are updated to the sample set of the dynamic background model. At this time, real-time updates can make up for the background model lags of regular updates. Real-time updates can help the system update the dynamic background model in time and eliminate false detection. Without real-time updates, the robustness of the dynamic background model is poor, and the kernel density algorithm may cause false detection, such as foreground points being detected as background points.

Instead of replacing background model points, a sample is taken randomly from M sample points and then replaced by the new pixel value of the dynamic background. Each point can be treated fairly in this method, as it eliminates the interference of subjective and time factors.

The method for updating the dynamic background model in dynamic pixel point is as follows:where denotes a timer, is the time interval of regular updates, and indicates that a sample is deleted randomly from sample set . The timer will be reset to zero after the update is completed.

4. Experiments

In this section, we compare BOWED with the most classic or the newest methods including ViBe [10], GMM [35], KDE [11], PBAS [13], SuBSENSE [27], GMG [36], LOBSTER [28], T2-FGMM with MRF [37], and IMBS [38]. The results of different methods are achieved by running available codes or software. The test video sequences came from a live monitoring video of a pumping unit with complex outdoor environments, including waving trees or crops and video noise, particularly those consisting of the rotational pumping unit in the background scene. The characteristics of the video test sequences are shown in Table 1. Furthermore, the ground truth of each frame in the videos was annotated manually.

The performances of each background modeling methods were evaluated at pixel-level. Therefore, foreground detection can be regarded as binary classification of each pixel. The framework of the CDnet 2014 challenges [39] was applied to quantitatively measure the performance of the proposed method.

Six metrics were used for evaluation: Precision, Recall, F-measure, False positive rate (FPR), False negative rate (FNR), and Percentage of wrong classification (PWC). The metrics used to quantify performance are as follows:

where true positive (TP) is the number of correctly detected foreground pixels and true negative (TN) is the number of correctly detected background pixels. In contrast, false positive (FP) is the number of background pixels that were incorrectly marked as the foreground pixel, and false negative (FN) is the number of foreground pixels that were incorrectly marked as the background pixel in the background subtraction method [40, 41].

To evaluate our proposal, a subset of state-of-the-art foreground detection methods was selected for comparison, which included ViBe, GMM, and KDE. For the ViBe method, we used the program which was provided by the author and the parameter setting was also suggested by the author. For other methods, we used the implementation available in the BGSlibrary [42]. The scenes with and without foreground objects contain the same scale dynamic disturbance and were detected separately.

4.1. Scene Detection without Foreground Object

In the surveillance of practical mining, live monitoring video is usually without any foreground objects. Therefore, the ability to adapt to the situation without foreground objects plays an important role in a foreground video detection algorithm. Low false alarm rate is one of the important indicators of an actual monitoring system, so FPR is adopted to evaluate the proposed foreground detection algorithm.

Background disturbance of video scenes in this paper is complex. Background modeling not only needs to address the interference of shaking leaves and crops and video noises, but must also address the rotation of the pumping unit, which has the same scale as foreground objects. Background disturbance in Video 1 contains pumping unit rotation and trees swaying. Video 2 has pumping unit rotation and video noises caused by the system device. Video 3 has pumping unit rotation and plants shaking. The results for detecting the dynamic background without foreground objects are presented in Figure 6. It is clear that the proposed algorithm considerably outperformed the others. In the background modeling stage, shaking of trees and crops, noises interference of video system, and rotation of pumping unit should be considered as the background of video and should not be detected as the foreground. The video frames are processed in Figure 6, where the 520th frame was used in Video 1, the 850th frame was used in Video 2, and the 800th frame was used in Video 3. Figure 6(a) is the scene graph which contains only a complex dynamic background. Figure 6(b) expresses the segmentation image of dynamic background region and static background region, which were obtained by the proposed method. Figure 6(c) is the result of the proposed method. Figures 6(d)–6(l) are the results of the other nine methods. Because the proposed method adopts a dynamic pixel background modeling strategy, it can eliminate the influence of the background texture caused by waves on the trees or plants shaken by the wind, and it can also eliminate the influence of the pumping unit rotation. Although other background modeling methods can eliminate the disturbance of small amplitude motion in the background, it is difficult for them to eliminate background motion interference of the same scale, such as the pumping unit rotation.

The FPR is shown in Table 2. To verify the performance of the algorithm in the detection of the dynamic background without a foreground object, morphology was not used to process the detection results. The FPR of the proposed method is no more than 0.1%, which is almost negligible, but the FPR of other mainstream background modeling methods is higher. When KDE is used for background modeling, FPR sometimes reaches as high as 20%, which indicates that the original KDE can detect dynamic pixels easily, but its ability to suppress the interference of dynamic pixels is weak. Obviously, the higher the FPR is, the easier it is to cause misinformation, which loses the original purpose of video anomaly detection. Table 2 presents results that show BOWED outperformed the others.

4.2. Detection of Foreground Object

Figure 7 shows the foreground segmentation results of the pumping unit monitoring videos. Figure 7(a) is the scene which has foreground objects entering the complex dynamic background. Figure 7(b) is ground truth. Figure 7(c) presents the result of BOWED method. Figures 7(d)–7(l) present the results of other state-of-the-art foreground detection methods. It is clear that the proposed algorithm outperformed the others. In particular, the ViBe algorithm was used to extract the dynamic pixels of the video first, so the integrity of foreground object extracted using BOWED was equivalent to that of ViBe. However, BOWED was able to filter the perturbation of the dynamic background better, and the simple ViBe still considered the rotation of the pumping unit as a foreground target. Notable results in the other state-of-the-art algorithms were as follows: Although KDE had the most complete extraction of foreground target, it mistook many of the dynamic background pixels as foreground target pixels, so its FPR was very high. T2-FGMM with MRF had a low FPR but could not detect the object correctly.

The results of the quantitative comparison are listed in Table 3. The larger the Precision, Recall, and F-measure values and the smaller the FPR, FNR, and PWC values, the better the segmentation. The best scores are highlighted in bold. It can be seen that the proposed method obtained lower FPR and PWC scores and higher Precision scores. It shows that the proposed method detected fewer incorrect pixels and achieved higher accuracy in pixels which were detected as foreground. For the FPR scores, the T2-FGMM with MRF method and BOWED had nearly the same score; however, using the T2-FGMM with MRF method, it was difficult to obtain the foreground object. For the F-measure scores, the proposed method is performed as well as PBAS and LOBSTER methods and better than other methods. These results indicate that BOWED can achieve a better global effect. Because our goal is to reduce false alarm rate as much as possible under the premise of ensuring accuracy in foreground object detection. Although BOWED did not give the best Recall and FNR scores, it inaccurately detected a large quantity of background pixels as foreground pixels. In that case, the pseudo foreground pixels were almost superimposed together, so that even the morphological method could not be used to separate the foreground object from the background. BOWED not only separates the foreground object from the background with ease but also has a very low false alarm rate, which achieves the main purpose of intelligent video surveillance.

5. Conclusions

This paper proposed a novel dynamic background modeling approach that was applied to same scale movement object interference for foreground detection. In the proposed method, the background pixels were divided into two types: dynamic background pixels and static background pixels. First, ViBe was used to extract the value of each pixel that was detected as foreground, and then KDE was used to model the detected foreground pixels to obtain the real foreground target pixels. The selection of bandwidth for KDE combines the advantages of the mean of the adjacent sample difference and the standard deviation of the sample, and the random extraction strategy was adopted when replacing the sample. For the strategy that combined timing and real-time updates within a given time interval, only calculation of kernel density was needed, as a calculation of threshold was not necessary due to the background model being updated in real-time when a foreground object was present. The proposed method was evaluated for highly dynamic backgrounds of complex industrial scenes which contained the periodic motion of a rotating pumping unit. The experimental results showed that our method can obtain satisfactory performance.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China 12th Five-Year Plan (no. 2011ZX05039).