High-Performance Computing Strategies for Complex Engineering Optimization ProblemsView this Special Issue
A TBB-CUDA Implementation for Background Removal in a Video-Based Fire Detection System
This paper presents a parallel TBB-CUDA implementation for the acceleration of single-Gaussian distribution model, which is effective for background removal in the video-based fire detection system. In this framework, TBB mainly deals with initializing work of the estimated Gaussian model running on CPU, and CUDA performs background removal and adaption of the model running on GPU. This implementation can exploit the combined computation power of TBB-CUDA, which can be applied to the real-time environment. Over 220 video sequences are utilized in the experiments. The experimental results illustrate that TBB+CUDA can achieve a higher speedup than both TBB and CUDA. The proposed framework can effectively overcome the disadvantages of limited memory bandwidth and few execution units of CPU, and it reduces data transfer latency and memory latency between CPU and GPU.
Video-based fire detection systems play an important role in the existing surveillance systems. Compared with conventional fire detection methods based on particle sensors , visual fire detection is more suitable in an open or large space, and it can provide abundant and intuitive information. For video-based fire detection, motion and color are the ordinary characteristics. There are several specific methods to find moving and flame color pixels by integrating background removal algorithms with Gaussian distribution models [2–4]. In addition to ordinary motion and color clues, flame and fire flickers can be detected by analyzing the video in wavelet domain [5–7]. These methods have been successfully applied in surveillance systems and proven effective. However, the demands of real-time processing require the acceleration of fire detection. Parallel processing is a suitable way to provide satisfactory performance for realistic applications.
GPU (graphics processing unit) has recently become a popular parallel platform for large-scale data computing. CUDA (Compute Unified Device Architecture), created by NVIDIA , provides a data-parallel programming framework and enables parallel execution of C function kernels . For this reason, many developers have taken advantage of the high performance of CUDA to accelerate computation across various problem domains, such as signal processing, computer vision, computational geometry, and scientific computing [10–13]. However, the focus of CUDA is on complicated calculations. The memory latency and data transfer latency between CPU and GPU in data processing still need further consideration. TBB (Intel Threading Building Blocks) is a running-based parallel library that offers a rich methodology to express parallelism in C++ programs [14, 15]. As a typical fine-grain parallel model, TBB supports parallel tasks which run on threads. In addition, TBB implements task stealing to balance parallel workload across available processing cores, leading to the reduction of load imbalance, increase of core utilization, and adaptability to dynamic environments. Some researches take advantage of TBB to improve algorithms such as Floyd-Warshall algorithm , a Three-tier Parallel Genetic Algorithm (TPGA) , and Large Dense Linear Equations .
In our work, a parallel programming framework of CUDA+TBB is provided. We apply TBB to initialize work running on CPU, and CUDA to perform background removal and adaption of model running on GPU. The hybrid parallel mode overcomes the disadvantages of limited memory bandwidth and less execution units of CPU in TBB. CUDA+TBB can effectively overcome the major drawback of CUDA by reducing the unnecessary data transfer latency and memory latency between GPU and CPU, resulting in computation acceleration. The rest of this paper is organized as follows. Section 2 presents the background modeling techniques based on single-Gaussian model and adaption of the model parameters. Parallelization to accelerate background removal is discussed in Section 3. Section 4 presents experiments where the parallel implementations are applied to video-based fire detection. We end this paper in Section 5 with conclusions and future work.
2. Background Modeling
Gaussian distribution is a common probability model that is widely used in pattern recognition and image processing to depict some random variables such as pixels and noise. For digital image processing, single-Gaussian model is used in the foreground extraction algorithm of the image whose background is single and stable. In the fire detection system, natural fire flames are seen as dynamic objects in the video images, which a fixed camera observed. The color of the fire regions is also a very important feature for distinguishing fire from others, so color information helps us to obtain fire regions more precisely. In this paper, we use a single-Gaussian model with mean and covariance matrix extracted from the video frames, which is based on RGB color spaces .
2.1. Background Removal Algorithm
The first step is to initialize the background model. In image processing, all operations are based on pixels: , where is the value of pixel, is the mean value of pixel, and is the standard deviation value of the pixel. Subscript denotes time . In this project, we use () frames to initialize the background. The formula of the mean and the standard deviation is as follows: where is the mean value of is the standard deviation of , .
The second step is to classify the pixel. In order to reduce the computation, every channel of color space is assumed to be independent, so for each pixel, the color probability is calculated by  where are, respectively, distribution models for red, green, and blue channels, is pixel value at coordinate , and is provability density of . For certain color channel , each distribution is given as follows:
Using the model parameters in (1), the following formula can show if the pixels are foreground or background: where denotes the result of channel’s detection. is set to 1 if spatial location of changed, 0 otherwise. is the constant which can affect the final change detection. In this experiment, . denotes the result of the pixel’s detection. If there are changes at least in two-color channels, shows that the pixel is marked as a foreground pixel, which is regarded as fire suspected area, otherwise it is considered to be a background.
2.2. Adaptation of Model Parameters
This step is to update the background model by adapting parameters. In general, the scene observed can change by lighting, or with other natural effects. In order to respond to environmental changes, the updating method of adapting pixel’s parameters is given by where denotes value of pixel at coordinate in th color channel at time , and denote mean values at times and , respectively, and and are standard deviations at times and , respectively. is a constant for updating model parameters in th color channel, which ranges from 0 to 1.
3. Parallel of Background Removal Model
In this section, we apply application of TBB+CUDA for background removal model. The hybrid architecture of CPU and GPU is shown in Figure 1. Firstly, video was decoded from the AVI formats on CPU. From the decoded video frames, we initialized mean and standard variance on CPU and transmitted initialized result to global memory on GPU. Secondly, kernel function reads the parameters from global memory to perform the multithreaded computing tasks. After all the calculations are finished, the result should be transmitted from GPU to CPU.
3.1. Realization of Parallel Algorithm Based on TBB
TBB supports scalable parallel programming using standard ISO C++ code, and it puts focus on computation in parallel without having to explicitly deal with threads. In TBB, we specify tasks instead of threads. Tasks are mapped and scheduled to physical threads by the TBB scheduler . Moreover, TBB can abstract platform details and simplify parallel programming. In addition, TBB provides a template-based runtime library which contains a series of data structures and algorithms, and it enables developers to devote themselves to address identifying concurrency rather than worrying about its management . Through TBB relevant template class, mean value and standard deviation are assigned to different threads, resulting in making full use of multicore resources. In particular, the algorithm based on TBB includes the following steps.
Step 1. Installation of TBB parallel computing platform and setting environment for the preprocessing.
Step 2. Initialization of a TBB task scheduler. A task scheduler object is task_scheduler_init, which is responsible for supporting the allocation of multiple threads.
Step 3. Development of parallel computing template class. The template of parallel_for is selected to obtain the mean and the standard deviation of frames in the body object.
Step 4. Invocation of the parallel template class “parallel_for.” Once we have the loop body written as a body object, the general form of the constructor is parallel_for (block_range<T>(begin, end, grainsize)). Parallel_for breaks this iteration space into trunks and runs each trunk on a separate thread. Each operator implements a grainsize. In this project, the value of grainsize is set 1000 .
Step 5. End the TBB task scheduler and get results.
3.2. Realization of Parallel Algorithm Based on CUDA
In this section, we present the CUDA-based background removal model. Details can be described as follows.
Step 1. Declare shared memory. Each block has 16 K shared memories to store the mean value and standard deviation of 256 pixels. Shared memory is divided into 16 banks. The size of each bank is 32 bits and adjacent 32 bits are organized. The instruction of SM is executed in a half-warp as a unit. So threads in a half-warp read the data linearly from the banks in shared memory.
Step 2. Assign the number of threads and the size of block. We set 256 as the number of threads of a block (_). The size of the blocks in a grid is width height/_.
Step 3. Compute the position of the first pixel. In this project, each thread handles a pixel. For each thread, CUDA sets the thread number as and block number as . Each thread can determine the location of the corresponding data source according to and . The formula is as follows:
The value of is the start of the data source. Each thread executes the kernel program which is satisfied with the parallel operations of CUDA.
Step 4. Copy data from global memory to shared memory. Because the threads of a block enjoy the same shared memory, and the access latency of share memory is less than global memory, we transfer the data from global memory to the shared memory which included .
Step 5. Transfer the updated data from GPU to CPU and release the memory of GPU.
4. Experiment Results
The experiments were conducted on a Pentium (R) Dual-Core E6500 2.94 GHz personal computer equipped with NVIDIA’s GeForce 210 GTX graphics card. The software is Intel TBB and CUDA 4.0. We applied three different parallel methods including TBB, CUDA, and TBB+CUDA, respectively. The video used in the experiments is real-world image sequences, which were taken from a random selection of commercial video clips. The type of fire is wild fire from mountains north of Athens. Matlab was used to convert the video sequence to 250 frames whose image sizes range from to .
The experimental results of runtime on different image sizes are described in Table 1. Runtime of TBB is obviously less than serial algorithm. TBB decides the number of threads that will be used by task scheduler automatically. Then it sends the threads to different cores according to the working stealing algorithm. The impact of TBB is emphasized when comparing the results obtained for serial approach.
Comparisons of the three methods’ speedups on different image sizes are listed in Table 2. Speedup is the ratio of sequential runtime to parallel runtime for the same task. Because the required number of threads is much larger than the dual cores of CPU, what is more, GPU has many execution cores and a larger number of registers for data processing. As it can be seen, TBB achieves a lower performance when compared to CUDA. The speedup of CUDA is nearly at least 9 times higher than TBB, where TBB+CUDA achieve a higher speedup than CUDA.
In Contrast with TBB and CUDA method, TBB+CUDA can significantly accelerate the single-Gaussian distribution model. It can reduce the communication overhead of data transfer latency and memory latency between CPU and GPU. Total runtime includes data transfer time from CPU to GPU, runtime of kernel, and data transfer time from GPU to CPU, so it is better to cut down the unnecessary latency and reduce the proportion of CPU-GPU data transfer time on GPU. As shown in Table 3, the latency of CUDA is nearly 10–19 times larger than CUDA+TBB. TBB is arranged to get the parameters and the proportion of latency with CUDA+TBB has shown 2%–17% improvement over CUDA.
In this experiment, images were captured at nighttime. Experimental results are presented in Figure 2 and Figure 3, where the true positive (TP) of fire detection is demonstrated. An average TP of 94% over 220 test video sequences in the experiment can be obtained. Figure 3 shows the degree of the flame and the effects of background removal in video frames of .
In order to accelerate the process of the background removal, this paper proposes a hybrid parallel mode. This parallel mode consists of two phases: the initializing phase by TBB running on CPU and the parallel computing phase for background removal and adaption of the model running on GPU with CUDA. The experimental results indicated that our solution makes full use of computation resources of GPU and CPU, leading to a higher speedup than TBB or CUDA. The hybrid parallel mode is generic to certain extent and can also be applied to other areas such as traffic routing and logistics location.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by “National Natural Science Foundation of China” (no. 61272523) and “the National Key Project of Science and Technology of China” (no. 2011ZX05039-003-4).
W. W. Jones, “An algorithm for fast and reliable fire detection,” in Proceedings of the 8th Fire Suppression and Detection Research Application Symposium, 2004.View at: Google Scholar
T. X. Truong and J. M. Kim, “Fire flame detection in video sequences using multi-stage pattern recognition techniques,” Engineering Applications of Artificial Intelligence, vol. 25, no. 7, pp. 1365–1372, 2012.View at: Google Scholar
D. C. Wang, X. Cui, E. Park et al., “Adaptive flame detection using randomness testing and robust features,” Fire Safety Journal, vol. 55, pp. 116–125, 2013.View at: Google Scholar
B. U. Toreyin, Y. Dedeoglu, and A. E. Cetin, “Wavelet based real-time smoke detection in video,” in Proceedings of the European Signal Processing Conference, 2005.View at: Google Scholar
NVIDIA, CUDA C Programming Guide, v. 3. 2. Nvidia Corp., 2010.
S. Zhang, Y. Zhu, and K. Zhao, GPU High Performance Computation-CUDA, Water Power Press, Bei Jing, China, 2009.
M. Czapiński, “An effective parallel multistart Tabu search for quadratic assignment problem on CUDA platform,” Journal of Parallel and Distributed Computing, vol. 73, no. 11, pp. 1461–1468, 2013.View at: Google Scholar
H. Hamzaçebi, Cuda Based Implementation of Flame Detection Algorithms in Day and Infrared Camera Videos [Ph.D. dissertation], Bilkent University, 2011.
J. Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism, O'Reilly Media, 2010.
Electronic Publication: Digital Object Identifiers (DOIs), “Intel threading building blocks tutorial,” 2007, http://www.threadingbuildingblocks. org/documentation.View at: Google Scholar
S. Zhang, Z. Wei, and W. Xuben, “Implementation of multi-core parallel computation for solving large dense linear equations based on TBB,” in Proceedings of the IEEE International Conference on Control Engineering and Communication Technology (ICCECT '12), 2012.View at: Google Scholar
D. Serfass and T. Peiyi, “Comparing parallel performance of Go and C++ TBB on a direct acyclic task graph using a dynamic programming problem,” in Proceedings of the 50th Annual Southeast Regional Conference, ACM, 2012.View at: Google Scholar
X. Chen, W. Chen, J. Li, Z. Zheng, L. Shen, and Z. Wang, “Characterizing fine-grain parallelism on modern multicore platform,” in Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS '11), pp. 941–946, December 2011.View at: Publisher Site | Google Scholar