Hardware Implementation of Digital Signal Processing AlgorithmsView this Special Issue
FPGA Implementation of Gaussian Mixture Model Algorithm for 47 fps Segmentation of 1080p Video
Circuits and systems able to process high quality video in real time are fundamental in nowadays imaging systems. The circuit proposed in the paper, aimed at the robust identification of the background in video streams, implements the improved formulation of the Gaussian Mixture Model (GMM) algorithm that is included in the OpenCV library. An innovative, hardware oriented, formulation of the GMM equations, the use of truncated binary multipliers, and ROM compression techniques allow reduced hardware complexity and increased processing capability. The proposed circuit has been designed having commercial FPGA devices as target and provides speed and logic resources occupation that overcome previously proposed implementations. The circuit, when implemented on Virtex6 or StratixIV, processes more than 45 frame per second in 1080p format and uses few percent of FPGA logic resources.
Real time background identification in high definition video streams is an important task in smart video systems [1–4]. Various algorithms that address the problem have been proposed in the scientific literature throughout the years [5–16].
Foreground detection algorithms based on temporal difference techniques [5–7] allow the detection of relevant objects (Foreground, Fg) by comparing two or more consecutive frames but are unreliable in presence of illumination changes.
More robust algorithms, named background subtraction algorithms [8–15] are based on the creation of a model that includes the static parts of the scene (Background, Bg). The Fg is detected by analyzing the difference between a frame and the model of the Bg. The Bg subtraction algorithms regularly update the Bg model and are resilient to slow changes of the scene. They differ for the considered model of the Bg and for the algorithm that takes care of model updating.
In  the Bg is represented with a statistical model based on a single Gaussian distribution per pixel. This allows a pixel by pixel model of the Bg with larger variance for pixels that experience wide changes of lighting.
The Gaussian Mixture Model (GMM) algorithm has been proposed in . It uses a statistical model in which the probability distribution of the luminosity of each pixel is modeled with a mixture of Gaussian distributions. The GMM algorithm provides good performances in both presence of illumination changes and multimodal Bg. A multimodal Bg is characterized by objects showing repetitive motions, for example, waves, moving leaves, or flickering light. When a pixel lies in a region where a repetitive motion occurs, its brightness oscillates between two or more values. This results in false Fg detection in most algorithms. On the contrary, when the GMM algorithm is used, the intensity distribution of the pixel is modeled using two or more Gaussian distributions and the problem of false Fg detection is mitigated.
The OpenCV (Open source Computer Vision software library developed by Intel ) provides a common base of computer vision instruments that is becoming the de facto standard for video/image processing. The Bg detection algorithm included in the OpenCV is an optimized version of the GMM algorithm of . It changes the initial learning phase for the Bg model, that is improved with respect to , and calculates the learning rate in a simplified yet efficient way. As a consequence, the OpenCV version of the GMM algorithm is both widely used and provides good performances.
1.1. Previous Hardware Implementations
In order to generate the updated Bg model, the GMM algorithm processes the video stream by computing a great number of parameters for each pixel of each frame, with a computational burden unfeasible for computers in real time applications. As an example, in , only a frame rates of 11–13 frame per second (fps) is obtained for frame size on an SGI O2 workstation.
Real time video applications with larger frame size require a dedicated hardware architecture. Hardware processors have been proposed in [17–22]. Reference  proposes a circuit able to process video sequences with frame size at 38 fps when implemented on a VirtexII FPGA platform. Processing capabilities of  are 39.8 Mega pixel per second (Mps). In  the design of an automated digital surveillance system running in real time on an embedded platform is presented. The segmentation unit of the circuit proposed in  is able to run at 83 MHz on VirtexII xc2pro30. In , the research group of  improves the memory throughput with respect to  by employing a memory reduction scheme. The improvement in memory throughput is paid with a reduction of the processing capability that is limited to 7.68 Mps.
Reference  proposes an OpenCV GMM algorithm implementation able to process 22 HD ( pixels) frames per second when implemented on Virtex5 xc5vlx50 FPGA. The circuit of  is used in  in conjunction with a denoising block that implements erosion, dilation, opening, and closing.
A circuit that implements the OpenCV GMM algorithm and provides increased performances has been presented in . The improvement in circuit performance derives from the optimization of the bit width of the signals and from a better formulation of the GMM equations.
1.2. Proposed Circuit
In this paper a hardware implementation of the GMM algorithm that allows real time processing of HD videos and comply with the OpenCV algorithm is proposed. To the best of author's knowledge, the proposed circuit is the one with highest performance proposed to date. The circuit processes gray scale videos. When color videos need to be processed, the circuit, as reported in , can be fed with the luminance channel of the YCrCb color space.
The proposed circuit is optimized for an FPGA implementation and, with respect to the best performing previous work, , increases the maximum working frequency (+77.4%) and reduces the area utilization (−3.8% of Slices and 0 versus 10 DSP).
The improved performances are obtained by (i)replacing full-width binary multipliers with truncated binary multipliers [23–26]; (ii)approximating nonlinear functions with piecewise linear approximations [27–31]; (iii)exploiting an innovative formulation of the updating equations of the GMM algorithm.
It is worth noting that the implemented design techniques (mainly the use of truncated multipliers and the approximation of nonlinear functions with piecewise linear functions) while allowing improved performances in terms of circuit speed and logic resources occupation are a modification of the reference OpenCV GMM algorithm and introduce a computation error that could result in a worsening of the quality of the processed video. The tests presented in Section 5, however, demonstrate that the proposed circuit is able to obtain very high performances while providing a good quality of the processed video sequences.
2. GMM Algorithm
The GMM algorithm has been proposed by Stauffer and Grimson, , with the target of efficiently dealing with multimodal Bg by using a statistical model composed by a mixture of Gaussian distributions. The GMM algorithm has been modified and included in the OpenCV libraries. A short description of the OpenCV GMM algorithm is given in the following. The differences with respect to the algorithm of  are indicated in the text.
2.1. Statistical Model
The statistical model for each pixel of the video sequence is composed by a mixture of Gaussian distributions, each one represented by four parameters: weight (), mean (), variance (), and matchsum, a counter introduced in the OpenCV algorithm.
Gaussian parameters differ for each Gaussian of each pixel and change for every frame of the video sequence. They are therefore defined by three indexes , where “” is the index for the pixel, “” is the index for the Gaussian distribution, and “” is for the frame. In the following the pixel index is omitted since the same operations are repeated for every pixel.
2.2. Parameters Update
When a frame is acquired, for each pixel, the Gaussian distributions are sorted in decreasing order of a parameter named Fitness (): A match condition is checked with the Gaussian distributions that model the pixel. The match condition is Equation (2) establishes if the pixel can be considered part of the Bg. A pixel can verify (2) for more than one Gaussian. The Gaussian that matches with the pixel () and has the highest value is considered as the “matched distribution” and its parameters are updated as follows: The parameter is the learning rate for the weight while is the learning rate for mean and variance. It is derived from as follows: Equation (4) is characteristic of the OpenCV algorithm and differs from what is proposed in  where is calculated, being the Gaussian probability density function as follows: The approach of  results, as shown in , in a slow convergence if compared with (4).
For the unmatched Gaussian distributions mean and variance are unchanged while the weights are updated as follows:
When the pixel does not match any Gaussians, a specific “no_match” procedure is executed and the Gaussian distribution with smallest is updated as follows: where vinit is a fixed initialization value and msumtot is the sum of the values of the matchsum of the Gaussians with highest . The weights of the Gaussians with highest are decremented as in (6) while their means and variances are unchanged.
2.3. Background Identification
The Bg identification uses the following algorithm: Equation (8) adds in succession the weights of the first Gaussian distributions, sorted beginning from the one with highest value, until their sum is greater than , a fixed threshold belonging to the interval. The set of Gaussian distributions that verify (8) represents the Bg and a pixel that matches one of these Gaussians is classified as a Bg pixel. The algorithm entails that if the “no_match” condition occurs, the pixel is classified as Fg.
3. Background Identification Circuit
Figure 1 shows a conceptual overview of a Bg identification system. A CMOS sensor captures each frame of the video sequence and a CMOS interface gives the pixel values to the Bg identification circuit. The Bg identification circuit processes the luminance value of the input “Pixel,” and the “Statistical Model” of the pixel for the given frame. The output data are the “Updated Statistical Model” and the “Fg/Bg” tag. For each pixel, the Gaussian parameters are read from an external memory and the updated parameters are stored in the memory. For each frame of the input video sequence, the Fg/Bg tags produce a binary image that can be displayed on a monitor.
In this paper, the “background identification circuit” of Figure 1 has been implemented on FPGA devices. Figure 2 shows the block diagram of the proposed circuit. The circuit is described in VHDL code and implements the GMM algorithm using, as suggested in , three Gaussian distributions for each pixel (). As a consequence, the proposed circuit processes 13 parameters per pixel (the luminance value and 12 Gaussians parameters for the statistical model of the pixel).
The software algorithm of the OpenCV library represents the Gaussians parameters as double precision (64 bit) floating point numbers. A double precision hardware implementation would be impractically large and slow and would require a very high bandwidth towards the memory. The proposed circuit represents the signals as fixed point numbers on a limited number of bits. The word length and the representation of the Gaussians parameters are determined by using the word length optimization algorithm proposed in  that minimizes the word length while constraining the Peak Signal-to-Noise Ratio (PSNR) of a set of test video streams. The input bits of the resulting circuit are the 99 bits for the parameters of the statistical model and the 8 bits of the luminance of the input pixel. The output bits are the updated statistical model, that requires 99 bits, and the Fg/Bg tag.
The optimized fixed point representation, an innovative formulation of the GMM algorithm equations (described in Section 4), the use of truncated multipliers, and the application of ROM optimization techniques (details in Section 5), allow the design of a Bg identification circuit with high working frequency and low logic resources utilization.
4. Hardware Oriented Formulation of the GMM Equations
The equations of the GMM algorithm, detailed in Section 2, if straightforwardly implemented, would be too complex for the realization of a fast and compact circuit. The proposed circuit is therefore based on the manipulation and simplification of the above-cited equations. A first step towards the simplification is the use of an optimized fixed point representation for circuit signals . A second step is the smart implementation of the nonlinear operations as will be shown in Section 5.
In addition to the above-cited techniques, in this paper, in order to reduce the logic occupation, a new order parameter for Gaussian distributions is introduced. The parameter is named IFitness (indicated in the following as ) and is defined as the inverse of the factor: With simple manipulations, taking into account (1) and (4), can be written as
Equation (10) shows that the factor can be computed as a function of the learning rate .
Moreover, in the proposed implementation, the learning rates and are quantized as power of two values: As a consequence, the factor can be computed as follows: and the multiplication between and can be performed using a shifter instead of a binary multiplier.
It is worth noting that ordering the Gaussian distributions as a function of , a process needed in the GMM algorithm in order to determine the Bg model according to (8), is equivalent to ordering versus the value.
Equations (4) and (12) can be implemented using only a shifter and a square root circuit that calculates the standard deviation . The implementation does not require the multiplier that would be needed to implement (1). The removal of the multiplier is not only beneficial for the implied reduction of hardware usage but, being the multiplier on the critical path of the Bg identification circuit, is also effective in increasing overall circuit speed. More details regarding the HW implementation of the above-described equations are given in the Section 5.
5. FPGA Implementation of the Background Identification Circuit
The schematic of the FPGA implementation of the GMM algorithm is shown in Figure 2. A detailed explanation of the functionalities and of the hardware implementation of the circuital blocks of Figure 2 is given in the following.
5.1. Learning Rate
Computes following (4).
The “learning rate” unit is implemented with three ROM, each of them stores the power of two values that better approximates as a function of ((4), (11)). The equation implemented by the ROM is in which the function “” is the approximation to the nearest integer value. The values are then used by the “IFitness” unit to shift the in order to calculate the factor according to (12).
5.2. Standard Deviation
The formulation of the GMM equations described in Section 4 requires the standard deviation values, , for the three Gaussians while the input data for the circuit include the variance values, .
The “Standard Deviation” block determines the variance values by implementing a square root circuit.
In the proposed implementation the signals are represented on 11 bits and range in [8 : 16376]. Due to the reduced number of bits of the variance signal, a ROM implementation for the square root circuit could be feasible but not optimal. In this paper the square root function has instead been implemented by using a piecewise linear approximation of the nonlinear function, a technique that has been previously proposed as an effective method to reduce hardware complexity [27–31].
The interval [8 : 16376] is divided, as shown in Figure 3, into four nonuniform intervals, . In each interval, a linear function approximates as follows: with and . In order to simplify the hardware, the coefficients are quantized as powers of two and the multiplication between and is performed by using a shifter resulting in lower area utilization and higher working frequency.
It computes according to (12).
The circuit is actually a combinatorial shifter that shifts as a function of the value and obtains the factor.
The proposed formulation of the GMM equations, that use (4) and (12) computed by the “learning rate” and “IFitness” block, implemented on Virtex5 vlx50 FPGA requires 64 Slices. This result can be compared with the implementation of (4) and (1) (the technique used in ) that, on the same FPGA, requires 123 Slices and 3 DSP. This demonstrates that the new formulation of the GMM algorithm equations is effective in reducing circuit complexity.
It verifies the match condition for the three Gaussian distributions and is composed by three identical units (one for each Gaussian). Each unit implements (15) and employs a subtractor and one comparator as follows:
5.5. Control Logic
5.6. Parameter Update
5.6.1. “Weight” and “Mean”
If the match condition is verified, these blocks update the parameters according to (3) and (6). The quantization of the learning rates as power of two allows to implement (3) by using shifters instead of multipliers.
Truncated binary multipliers, for which the optimality in terms of mean square error is analytically demonstrated, have been proposed in [23–26]. Truncated multipliers have been used with good results in a variety of applications [33–35]. A full-width digital bits multiplier computes the bits output as a weighted sum of partial products. A truncated multiplier is an multiplier with an bits output.
In the proposed circuit the inputs of the multipliers are represented on 12 bits. The output of a full-width multiplier is hence on 24 bits. In order to reduce HW complexity and improve circuit speed, the proposed circuit implements a truncated binary multiplier, based on the architecture proposed in , in which the output is directly produced on 15 bits (9 less-significant bits are discarded with an almost 40% reduction of the required HW for the multiplier).
5.6.3. “No Match”
When no Gaussians match the pixel, the “No Match” block updates the parameters of the Gaussian with highest IFitness value according to (7). Equation (7) requires the calculation of the inverse of the msumtot signal. In  a ROM-based approach is used to implement the required function. Since the msumtot signal is represented on 6 bits and its inverse is on 8 bits, the ROM size is equal to .
In the proposed paper the ROM is eliminated by using a piecewise linear approximation technique similar to those used for the calculation of the signals. The msumtot signal range [1 : 63] is divided into four nonuniform intervals and in each interval, a function approximates 1/msumtot as follows: with , , and quantized as power of two so that the multiplication between and msumtot is performed using a shifter.
This is the circuit that updates the matchsum signal according to (3) and (7). The matchsum signal is a counter, associated to every Gaussian, introduced in the OpenCV with the aim to improve the initial learning phase of the GMM algorithm (as demonstrated in ). The matchsum signal counts the number of times that a given Gaussian is matched according to (2). The introduction of the matchsum entails the synthesis of three counters that are not on the critical path. The only drawback is an increase of circuit area.
5.7. Output Selection
It establishes the values of the updated parameters depending on whether the match condition is verified or not.
5.8. Bg Identification
It verifies the background identification condition shown in (8) determining, also as a function of the “No Match” condition, if the input pixel belongs to the background or not.
6.1. CAD Tools for Design and Test
The proposed FPGA implementation of the OpenCV GMM algorithm is synthesized and implemented on Virtex6 (xc6vlx75t), Virtex5 (xc5vlx50), and Spartan3 (xc3s1000) Xilinx FPGA and on Stratix IV (SGX230HF35C2) Altera FPGA. The performances of the proposed circuit on FPGA devices and the comparison with previous art are shown in Table 1.
For Virtex6, Virtex5, and Spartan3 the synthesis has been carried out using XST and fitting and Place&Route have been carried out using ISE. The implementation on StratixIV has been conducted with Quartus II. ModelSim PE has been used to perform the circuit simulations.
6.2. Quality of the Processed Video Streams
The implementation of (12), the utilization of truncated multipliers, and the implementation of ROM approximation techniques could worsen the accuracy of the processed video streams. Figure 4 shows a selection of frames with resolution extracted from two video sequences. The original frames of Figures 4(a) and 4(d) have been processed by using the proposed implementation (Figures 4(b) and 4(e)) or by using the circuit of  that uses full-width multipliers and no ROM compression techniques (Figures 4(c) and 4(f)). The Bg masks of Figures 4(b), 4(e), and 4(c), 4(f) differ in few pixels (in both the video sequences less than the 1% of the total number of the pixels is differently classified by the two circuits). This demonstrates that the proposed implementation provides negligible effect on the quality of the processed video streams.
6.3. Circuit Performances
Table 1 shows that the circuit of Figure 2, implemented on high-performances FPGA devices, is able to process more than 43 HD fps by using less than the 5% of the FPGA resources. Implemented on Virtex6 processes 46 HD fps by using the 2% of the total number of Slice of the FPGA. On Virtex5 the frame rate is 43 HD fps while the logic utilization is equal to 301 of 7200 Slices, when implemented on StratixIV processes 47 HD fps by using less than the 1% of Logic Adaptive Blocks (LAB).
The performances analysis has been conducted by including input and output registers that synchronize the circuits and provide timing performances that are not dependent on the I/O pads. The reported frame rate values are obtained by considering the maximum work frequency of the circuits (Frequency in Table 1), obtained by imposing a constraint on the minimum period of the clock signal. For each implementation of Table 1, the synthesis of the circuit has been carried out by repeatedly reducing the constraint. The process has been stopped when a 0.1 ns reduction made the implementation impossible. The last synthesized clock period has determined the maximum work frequency of the circuit. The timing analysis of the obtained implementations has allowed to identify the critical path of the circuit that includes “Match,” “Control Logic,” “Variance,” and “Output Selection” units of Figure 2. However, the very closed constraint on the minimum clock period entails that all the paths between inputs and outputs of the circuit have delay values almost equal to the maximum delay.
The proposed FPGA implementation, as shown in Table 1 outperforms the circuit proposed in . Since that the proposed circuit improves many blocks that in  was on the critical path, when the proposed circuit is implemented on Virtex5 xc5vlx50, the maximum working frequency almost doubles (+77.4%), while the area utilization is strongly reduced (−3.8% in number of used Slices and zero DSP used versus 10 DSP).
Table 1 also shows the performances of the circuit on low-cost Spartan3 FPGA. Implemented on Spartan3, the proposed circuit is able to run at 35.30 MHz by using the 13% of the total number of Slice of the FPGA.
An OpenCV compatible hardware implementation of the GMM algorithm for Bg identification is presented in the paper. The circuit has been designed having commercial FPGA devices as a target and is able to process 1080p video at 47 frames per second when implemented on Statix IV FPGA.
The circuit exploits an innovative formulation of the equations of the GMM algorithm, approximates some nonlinear functions with piecewise linear polynomials, and uses truncated binary multipliers. The circuit is implemented on both Altera and Xilinx FPGA and, when compared with previously proposed Bg identification circuits, provides improved speed and reduced logic utilization.
I. Haritaoglu, D. Harwood, and L. S. Davis, “A fast background scene modeling and maintenance for outdoor surveillance,” in Proceedings of the 15th International Conference on Pattern Recognition, no. 4, pp. 179–183, Barcelona, Spain, 2000.View at: Google Scholar
H. Jiemnez and J. Salas, “Temporal templates for detecting the trajectories of moving vehicles,” in Advanced Concepts for Intelligent Vision Systems, vol. 5807 of Lecture Notes in Computer Science, pp. 485–493, 2009.View at: Google Scholar
M. Ren and H. Sun, “A practical method for moving target detection under complex background,” Computer Engineering, vol. 31, no. 20, pp. 33–40, 2005.View at: Google Scholar
C. Ridder, O. Munkelt, and H. Kirchner, “Adaptive background estimation and foreground detection using Kalman filtering,” in Proceedings of the 10th International Congress for Applied Mineralogy (ICAM '95), pp. 193–199, 1995.View at: Google Scholar
K. P. Karmann and A. Brandt, “Moving object recognition using an adaptive background memory,” in Time-Varying Image Processing and Moving Object Recognition, V. Capellini, Ed., pp. 289–307, Elsevier, 2nd edition, 1990.View at: Google Scholar
K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: principles and practice of background maintenance,” in Proceedings of the 1999 7th IEEE International Conference on Computer Vision (ICCV '99), pp. 255–261, September 1999.View at: Google Scholar
C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: real-time tracking of the human body,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997.View at: Google Scholar
C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), pp. 246–252, Fort Collins, Colo, USA, June 1999.View at: Google Scholar
OpenCV library on source forge, http://sourceforge.net/projects/opencvlibrary/.
M. Genovese and E. Napoli, “FPGA implementation of OpenCV compatible background identification circuit,” in Proceedings of the 3rd International Symposium on Computational Modelling of Objects Represented in Images: Fundamentals, Methods and Applications (CompIMAGE '12), pp. 75–80, Rome, Italy, September 2012.View at: Google Scholar
D. De Caro, E. Napoli, and A. G. M. Strollo, “Direct digital frequency synthesizers using high-order polynomial approximation,” in Proceedings of the IEEE International Solid-State Circuits Conference, vol. 1, pp. 134–135, San Francisco, Calif, USA, February 2002.View at: Google Scholar
P. KaewTraKulPong and R. Bowden, “An improved adaptive background mixture model for real-time tracking with shadow detection,” in Proceedings of the 2nd European Workshop Advanced Video Based Surveillance Systems, 2001.View at: Google Scholar
S. S. Kidambi and P. El-Guibaly, “Area-efficient multipliers for digital signal processing applications,” IEEE Transactions on Circuits and Systems II, vol. 43, no. 2, pp. 90–95, 1996.View at: Google Scholar