Abstract

We propose a time-consistent video segmentation algorithm designed for real-time implementation. Our algorithm is based on a region merging process that combines both spatial and motion information. The spatial segmentation takes benefit of an adaptive decision rule and a specific order of merging. Our method has proven to be efficient for the segmentation of natural images with few parameters to be set. Temporal consistency of the segmentation is ensured by incorporating motion information through the use of an improved change-detection mask. This mask is designed using both illumination differences between frames and region segmentation of the previous frame. By considering both pixel and region levels, we obtain a particularly efficient algorithm at a low computational cost, allowing its implementation in real-time on the TriMedia processor for CIF image sequences.

1. Introduction

The segmentation of each frame of a video into homogeneous regions is an important issue for many video applications such as region-based motion estimation, image enhancement (since different processing may be applied on different regions), 2D to 3D conversion. These applications require two main features from segmentation: accuracy of regions boundaries in the spatial segmentation and temporal stability of the segmentation from frame to frame.

As far as spatial segmentation is concerned, it can be classified into two main categories, namely, contour-based and region-based methods. In the first category, edges are computed and connected components are extracted [1]. One of the drawbacks of such an approach is that the computation of the gradient is prone to large errors especially on noisy images. Moreover, the closure of the edges in order to create connected regions is a difficult task and an efficient resolution of such a problem may induce cumbersome computations. Finally, such an approach cannot take benefit of statistical properties of the considered image regions. The region-based segmentation methods avoid these drawbacks by considering regions as basic elements. Among region-based segmentation methods [26], we are interested here in a bottom-up segmentation approach where regions are grown using a merging process. In such approaches, similar neighbouring regions are merged according to a decision rule [7, 8]. The initial regions can be the pixels or an over-segmentation of the image which can be obtained by a watershed algorithm [9, 10]. As mentioned by [11], bottom-up algorithms rely on three notions: a model for the description of a region, a merging predicate, and a merging order. This gives rise to numerous heuristics according to the different choices performed on these three steps [4, 7, 1214]. Compared to other classical approaches, for example, [7, 12, 13], the authors of [4] have proposed recently an adaptive threshold justified by statistical inequalities. They obtain good results with few parameters to tune. However, in the context of a real-time implementation, their merging predicate still requires too many computations. Moreover, their algorithm is dedicated to the segmentation of still images and so, it does not take into account the temporal dimension of video sequences.

When dealing with video segmentation, various algorithms have been tested in the literature. The first class of approaches proposes to perform a 3D segmentation by considering the spatiotemporal data as a volume. We can cite the work of [15] that takes benefit of the 3D structures tensor for segmentation. Some other recent works propose 3D approaches using a mean-shift-based analysis [16, 17]. Let us note that if each shot is segmented as a 3D volume, the number of frames to store for each segmentation may be unbounded. On the other hand, if the number of stored frames is artificially limited by the available memory, some 3D regions may be artificially split on long shots. Therefore, 3D approaches require the storage of several frames in memory and necessitate a high bandwidth which is a drawback for the design of electronic devices.

The second class of methods concerns frame-by-frame algorithms. In these approaches, the spatial segmentation of the second frame is deduced from the spatial segmentation of the first frame using motion estimation [13, 1820]. Regions from adjacent frames are then merged according to motion similarity, colour similarity, or localisation similarity. In such approaches, a matching is performed between regions of the different frames. All the regions are then linked and video objects tracking algorithms [20] may then take benefit of such a correspondence between regions.

On the other hand, some applications, such as image enhancement or video compression, may need a coherent segmentation between frames without requiring an exact tracking of each region from frame to frame. In this paper, we propose a segmentation algorithm devoted to such applications. The first aim of our algorithm is thus not to match the regions of two consecutive frames but only to take benefit of the spatial segmentation of the first frame in order to construct a coherent spatial segmentation of the second one.

Our contributions may be divided in three points.

(i)Spatial segmentation: our spatial segmentation takes benefit of both an adaptive decision rule and an original order of merging. As in [4], the adaptive threshold is computed using a statistical modelisation of the region combined with the statistical inequality of McDiarmid [21]. However, in our approach, each pixel is modelled as a single random variable (in [4], the authors model each pixel as a sum of random variables). This method gives a simpler predicate that is more adapted to real-time implementation. Good results are obtained for spatial segmentation with few parameters to be set.(ii)Temporal consistency: another contribution is the design of a region segmentation that does not encounter strong variations over time. We propose to simply take benefit of scene-change detection, that is widely used in video segmentation [2224], rather than motion estimation that remains a real bottleneck for real-time implementation. We construct a coherent segmentation from frame to frame by combining both pixel and region information through the use of an improved change detection mask () that takes benefit of the region segmentation of the previous frame. Experimental results conducted on real video sequences demonstrate a good temporal consistency.(iii)Hardware implementation: as far as the implementation is concerned, we exploit the data level parallelism (DLP) by processing some basic treatments in parallel. Moreover, the classical union-find data structure [25] is improved by using local registers to reduce the access time of find operations. We obtain an efficient algorithm for video segmentation at a low computational cost. Our method runs in real time on the TriMedia processor for CIF image sequences.

The paper is organised as follows. The spatial segmentation method is detailed in Section 2. The temporal consistency improvement is explained in Section 3. In Section 4, we discuss the implementation of the algorithm. Experimental results and measures are given in Section 5.

2. Spatial Segmentation

Let us consider an image the notation represents the cardinal and the pixel intensity at position in the frame

A region-based segmentation problem aims at finding a relevant partition of the image domain in regions We focus here on region-merging algorithms where a decision criterion determines whether two regions must be merged or not. In this paper, we first introduce a statistical model for the regions. We then detail how these statistical tools are used for the computation of the merging predicate. We finally explain the whole merging algorithm and especially the order of merging.

2.1. Statistical Model

Images are corrupted by noise which gives random values (r.v.) to pixel intensities. Due to this random part in image acquisition systems, an image is classically considered to be an observation of a perfect statistical image The intensity of a pixel is then modelled as the observation of a random vector whose values belong to the interval (e.g., for bits images). An ideal region is then represented by a vector of independent r.v. where Let us denote by the real region associated to , that is, composed of the same set of pixels as . The intensity of the pixel of within is then considered as an observation of the r.v. . Following [4], we define a partition of into homogeneous regions by the following requirements:

(1)all the pixels of any statistical region should have the same expectation (2)two adjacent pixels belonging to different statistical regions should have different expectations

Such a definition may be easily extended to multichannel images [4] by requiring that the pixel expectations are equal on each channel within one region and that the expectation of at least one channel differs between pixels belonging to different regions.

Note that according to our definition, all the pixels of one region should have the same expectation. The regions extracted by a segmentation algorithm based on this definition should thus be composed of pixels with a nearly constant intensity (we thus assume an underlying flat facet model). This criterion may be justified by the reflective properties of surfaces. Indeed, the reflection of light under a surface is determined by a Lambertian and a specular component [26]. The specular component produces specular spikes often characterised by regions with a nearly maximal intensity. The specular component decreases abruptly and may be neglected, within a segmentation scheme, outside the specular spikes. The intensity of a Lambertian surface varies slowly according to its normals. A region of the image with a nearly constant value correspond thus either to a specular spike or to a Lambertian surface with an almost constant normal. Such a segmentation scheme provides thus a partition which resumes the main physical and geometrical properties of a 3D scene. Higher-level processes such as the segmentation of the image into objects or the segmentation of textured objects [27] would require to input within the algorithm a priori knowledge about what are the expected objects of the scene or what a textured area is.

In order to be selfcontent, let us now introduce the very useful statistical inequality proposed by [21] and introduced within the region segmentation framework by [4]. We take benefit of this inequality for the computation of the merging predicate.

Theorem 1 (McDiarmid's inequality). If are independent random variables whose observations take their values in a measurable space and is a function that satisfies the following constraint for : where and are two different possibilities for the th component of an observation vector Then for every ,

2.2. Merging Predicate

In order to compute a merging predicate, we consider two regions and of a current partition. The associated vectors of r.v. in the ideal image are respectively denoted by and . The r.v. and denote respectively the means of and . We suppose that and belong to the same homogeneous region of . Our default decision rule consists thus to merge the two regions and , respectively associated to and . However, under the hypothesis that and are included in the same homogeneous region of the probability that is greater than a given value is bounded by Theorem 1. If this probability falls under a given threshold, we refuse the hypothesis and thus do not merge the two regions and .

More precisely, let us consider the vector and the mean functions Our merging decision rule is based on the following theorem.

Theorem 2. Let one consider two vectors of r.v. and encoding the intensities of two connected regions of an ideal image . Under the hypothesis that and are included into the same homogeneous region and using the previously defined notations, one has where denotes the size of vector (i.e. the cardinal of the associated region ).

Proof. Let us consider the vector in . This vector may be considered as an outcome of the r.v. . In order to apply the McDiarmid theorem we define the following function: where and .
Let us compute the variation of the function. If we make a variation on the intensity of one with . We have
This gives us the value of the bounding coefficients for the first variables. Similarly, if we make a variation on the intensity of we obtain . We then compute the sum over all the variables:
Moreover, according to our hypothesis, if and belong to the same homogeneous region of all the pixels of and have the same expectation. We have thus, and we obtain the expected result using conjointly Theorem 1 and (10).

Note that the bounds on the probability provided by Theorem 2 may be equivalently represented by After some basic calculus we find that, under the assumption that and are included into the same homogeneous region of we have with a probability at most with

Below the probability , which is supposed to be low, we consider that the event is not probable. In this case, we refuse the initial hypothesis stating that and belong to the same homogeneous region of and thus do not merge the two regions. Our merging predicate may thus be stated as follows: where and denote respectively the values of and for the observation . These two terms represent the mean value of the two regions and . The term denotes the maximum level of for gray-scale images).

Note that our merge criterion is equivalent to The left member of this last equation corresponds to the difference between the squared error of and the sum of the squared errors of and [28]. Our merge criterion may thus be also interpreted as a bound on the increase of the squared errors of the regions.

Our criterion may be adapted to multichannel images as follows: where represents the mean value of the region for the channel taken in the set of channels and denotes the maximum value on channel . We take the maximum of the values obtained for each channel as a criterion. Indeed, if the predicate is true, it will be true for all the channels and so the merge hypothesis is accepted. In this paper, we have chosen the space which is the native colour space of video sequences.

Both our method and the one of Nock [4] are based on the McDiarmid inequality. However, Nock models each pixel of the ideal image as a sum of random variables whereas our method only uses one r.v. per pixel. The approach proposed by Nock consists to fix the probability and to use in order to vary the merge threshold. To our point of view, the probability below which we refuse the merge hypothesis has a more straightforward interpretation than the variable . The resulting criteria are slightly different, our criterion differs by a factor from the one first proposed by Nock. Our criterion is also significantly different from the second Nock criterion which uses an estimate of the number of final regions whose cardinal is equal to a given value. However, both our criterion and the final Nock criterion may be related, our one being more strict than the one of Nock [4] for a given probability .

Let us note finally that the way we derived our criterion provides an alternative explanation to the eventual over-merging produced both by our algorithm and the one of Nock. Indeed, our basic hypothesis consists to suppose that and belong to the same homogeneous region of As in a contrario approaches first introduced by [29], we refuse this hypothesis only when we observe an event which has a low probability (according to ) to occur under this hypothesis. We may thus merge regions corresponding to different homogeneous regions of if our observation does not contradict our hypothesis.

2.3. Merging Order

An edge denotes a couple of adjacent pixels in a 4-connectivity scheme. The set of edges of an image is denoted by and the number of edges by The order of merging is built on the edges weights as in [4, 12]. The idea behind this order of merging is to merge first similar regions rather than different ones. The similarity between pixels is measured by computing the distance between two pixel colours as follows: For colour images, the edge weight becomes where denote the three channels of a particular colour space.

Note that alternative weight may be designed. For example, one may balance the distance along each axis of a color space by some weight (or equivalently scale each axis according to its weight). Numerous colour space with different properties may be chosen in (17). For our algorithm, we consider the colour space which is the native colour space of CIF sequences. The colour space provides partitions with a little greater subjective quality but with a higher computational cost.

The edges are sorted in an increasing order of their weights and corresponding couples of pixels are processed in this order for merging. This sorting step only requires two traversals of the image: the first traversal allows to compute the histogram of edge weights. The second traversal stores each edge in an array associated to its weight. The amount of memory required for each array is deduced from the histogram of edge weights. This sorting step is similar to the one usually used within the watershed algorithm [9].

2.4. Merging Algorithm

Our spatial segmentation could be divided in three steps. In the first one, we compute the weights of edges and their histogram. In the second step, we sort edges increasingly according to their weights. In the last step, we merge pixels or regions connected by edges following their order. Algorithm Algorithm 1 describes more particularly the merging loop.

𝐟 𝐨 𝐫 𝑖 = 1 t o 𝑁 𝑒 𝐝 𝐨
Read the 𝑖 th edge: ( 𝑝 1 , 𝑝 2 )
𝑆 1 = F I N D ( 𝑝 1 )
𝑆 2 = F I N D ( 𝑝 2 )
𝐢 𝐟 𝑃 ( 𝑆 1 , 𝑆 2 ) = T r u e 𝐭 𝐡 𝐞 𝐧
U N I O N ( 𝑆 1 , 𝑆 2 )
𝐞 𝐧 𝐝 𝐢 𝐟
𝐞 𝐧 𝐝 𝐟 𝐨 𝐫

The term represents the number of edges within the image in the 4-connectivity. In the merging process, we use the union-find data structure [25]. The union function merges two disjoint regions into one region, and the find function identifies the region to which a certain pixel belongs. Implementation details are given in Section 4.

3. Time Consistency Improvement

In video segmentation, the quality of the spatial segmentation is not the only requirement, time consistency is also a very important one. If, in two successive frames, one region is segmented very differently because of noise, occlusion or deocclusion, results of segmentation would be very difficult to exploit for any application like image enhancement, depth estimation, and motion estimation. Many works, see for example [19], use motion estimation to improve time consistency in video segmentation. However, motion estimation [30] is a real bottleneck for real-time implementation and is even sometimes unreliable. In this paper, we combine an improved change detection mask with spatial segmentation in order to improve the temporal consistency of our segmentation.

3.1. Change Detection Mask

The CDM is designed using both illumination differences between frames and region segmentation of the previous frame.

We first detect changing pixels using the frame difference. Then, we take benefit of the region segmentation of the previous frame in order to classify the pixels not only at a pixel level but also at a region level.

Given the current frame and the previous one the frame difference is given by Classically, is thresholded in order to distinguish changing pixels from noise. The pixel label is given by where is a positive constant chosen according to the noise level of the image. This threshold may be set experimentally (Section 5) or estimated according to any measure of the image noise. A pixel , with , is considered as a changing pixel. We then use the previous segmentation in order to convert the from the pixel level to a region level which is more reliable [23]. For each region in the previous segmentation, we compute : which denotes the number of changing pixels of the current image whose coordinates belong to in the previous segmentation. We then compute which represents the ratio of changing pixels between the previous and the current image in the region Pixels are then classified using three categories: where is a positive constant. In the experiments, we take (i.e. a region is a changing region when it contains at least of changing pixels). The value of the threshold is chosen so that we do not miss any changing region.

Every pixel of regions qualified as static is labelled using . The two other labels concern pixels within changing regions. Depending on the value of the frame difference, the pixel is qualified as a changing one or as a one . Such a classification is then used to segment the current frame. An example of classification is given in Figure 1 for the video sequence “Table”.

3.2. Merging Process

The merging process is now divided in three main steps. Firstly, static regions are kept as they were segmented in the previous frame. Secondly, we apply a connected component labelling (CCL) algorithm [31] to extract connected components of pixels with . This second step builds seeds from the segmentation of the previous frame. These seeds link the current segmentation to the previous one in a time-consistent way. Thirdly, we apply the spatial segmentation only on edges connecting a changing pixel within a changing region to a pixel belonging to a changing region. This last pixel may be either changing or static Note that static pixels within changing regions have been connected in the second step by a CCL algorithm.

The whole process can be formalised as follows. Considering an edge between two pixels, we define the following function:

The function allows us to classify the edges in the following three categories (a brief summary is provided by Figure 2).

(i)The first category (Figure 2(a)) corresponds to the edges which have at least one pixel belonging to a static region. These edges are not considered for the segmentation of the current image . Static regions are then segmented in the same way between two successive images and . (ii)The second category (Figure 2(b)) corresponds to the edges that connect two non changing pixels in changing regions. For these edges, we simply apply a connected component labelling (CCL) algorithm [31].(iii)The third category (Figure 2(c)) corresponds to the edges which have at least one pixel that is considered as a changing one (i.e. ). These edges are processed using the merging order and the merging predicate defined in Section 2.2. Edges belonging to this category are denoted by

Figure 3 describes the three steps corresponding to the process of the three categories of edges.

In Section 5, we propose the computation of an objective measure for temporal consistency. The measures obtained on real video sequences demonstrate a real improvement of temporal consistency. Moreover, the way we exploit the decreases also the computational cost of the algorithm since the edges in static area are not reconsidered, and those linking the “no changing pixels” in changing area are simply processed by a CCL algorithm.

When successive images are not correlated (in the case of a scene cut, e.g.), the set contains most of the edges of the image which leads to a new spatial segmentation as shown in the example of a shot cut given in Figure 11. Our algorithm handles, thus, naturally the shot cuts and does not need to be combined with a shot cuts detection algorithm.

4. Implementation Considerations

In this section, we propose to describe optimisations that have been made to allow a real-time treatment. The whole algorithm of video segmentation is summarised in Figure 4.

Apart from the merging loop, all other functions access pixels data in a predictable way (e.g., from top to bottom left to right). The cache memory benefits from this regularity, since it exploits spatial and temporal locality of data, and consequently causes less cache misses. In the merging loop, the union-find data structure is unpredictable, and consequently causes an important data cache stalls. To reduce the data cache stalls cycles, we investigate some optimisations that are detailed in the following sections and we take benefit of the TriMedia processor to exploit the data level parallelism (DLP) and instruction level parallelism (ILP) of our algorithm.

4.1. Organisation of Data

Our organisation of data should allow an efficient computation of both our merge criterion (13) and our union and find operations. Let us recall that when using an union-find merging scheme each region of the image is encoded by a spanning tree whose vertices are the pixels of the region. These tree data structures are usually encoded by storing for each pixel the index of its parent within the spanning tree. The information about the region are associated to the root of the trees and both the roots and the region information are updated during an union operation.

Since our merge criterion only uses the mean color and the cardinal of the regions, one simple organisation of our data would consist in associating each pixel with the fields , where denotes the father of within the tree.

However, grouping the region data and the father index would require to manipulate the whole vector within find operations. Since only the father field is required by the find operation such an organisation of the data would induce the storage of useless data within the cache memory.

We thus decided to store into two separate arrays the data required for the merge operations (namely the vector and the encoding of the trees. More precisely, our organization of data is as follows:

(1)one array Data which stores for each created region its fields;(2)one array Father which encodes our sequence of union operations;(3)one array Label of size initialised to a special flag indicating that each pixel is initially its own father.

If a region is reduced to a single pixel , Label is set to a special flag and the data of the region retrieved from the image . We thus decide to create a new entry within the array Data only if the associated region is composed of at least pixels. More precisely, if a merge of two pixels and is decided by our merge criterion,

(1)a new entry is created within the array Data and initialised according to and (2)label and Label are set to (3)father is set to a special flag indicating that has yet no father.

Our data structure is further updated in the two following cases.

(1)One pixel is aggregated to an already created region labelled by . In this case, Label is set to and Data is updated according to . The array Father remains unchanged.(2) Two already created regions with respective labels and are merged. In this case, one of the labels (say ) survives, Data is updated according to Data and Father is set to .

Figure 5(b) illustrates the state of our different data structures after the segmentation of Figure 5(a). Two pixels in Figure 5(a) are merged if they have the same label. In this example, we first considered horizontal edges between pixels and then vertical ones. Both horizontal and vertical edges have been considered using a scan line order. Note that the array Data is completely filled by the four regions created during the union operations. We only get three final regions as encoded by the array Father where all labels, except label 2, are their own father.

Since all regions encoded by the array Data are composed of at least pixels, the maximal number of entries within this array is equal to . Moreover, the vertices of the trees encoded by the array Father correspond to regions composed of at least pixels. The maximal size of the array Father is thus also equal to . Note that this upper bound may be reached if we first decompose the image into regions made of adjacent pixels and then order the merges in such a way that the tree encoding the union of all these elementary regions is linear.

Note that when using such an organisation of data, all the required memory is allocated before union and find operations. We thus avoid the risk of a memory overflow.

4.2. TriMedia Processor

We experimented this data organisation on the TriMedia processor [32]. The cache memory of this particular TriMedia is 128 KByte, 4 way associative, with block of 128 Byte. The replacement algorithm used is .

In order to increase the computational efficiency, we propose to take benefit of the data level parallelism (DLP) provided by our algorithm (computation of edge's weight, frame difference, classification of pixels in ). This allows to increase the throughput (i.e., amount of pixels processed per unit time) by processing data in parallel when it is possible. The core of TriMedia is a VLIW architecture with 5 issues slots. Each slot has some functional unit, and each functional unit could process 4 bytes in parallel (SIMD mode). The instruction level parallelism (ILP) is extracted by the compiler, while the DLP could be exploited through the use of custom operations, loop unrolling, and grafting. So we use these optimisations to exploit the DLP available in our algorithm.

5. Experimental Results

In this section, we present experimental results of our algorithm run on TriMedia with many very known video sequences.

5.1. Spatial Results

The probability tunes the coarseness of the segmentation. In Figure 6, we show the influence of this parameter on the level of details obtained. This parameter is highly correlated to the number of segmented regions. A value of this parameter around provides a sufficient level of details for most of the video sequences we have considered. However, the chosen value and the associated level of details are highly dependent on the application. We can remark that this algorithm is able to segment very precisely small regions of interest such as the mouth or the eyes of “Akiyo”. It can also segment the different numbers of the calendar in the sequence “Mobile”. However, we can observe an over-segmentation of some textured regions such as the wall in the sequence “Table”. This is mainly due to the fact that assumption (1) is more adapted to the segmentation of flat regions. Our ongoing research is directed towards the design of a new merging criterion for the segmentation of textured regions.

As a comparison, we propose here some results obtained with two other well-known algorithms: algorithm EGBIS [12] and the statistical region merging (SRM) algorithm of Nock and Nielsen [4]. These two algorithms are based on region merging schemes with the same merging order than our method. The main difference between the three methods lies in the merging predicate. The results are displayed in Figure 7. For each algorithm, we have tuned the parameters in order to reach a segmentation that allows a good subjective representation of the elements of the image (numbers of the calendar, eyes of the woman, etc.). We can see on these examples that our real-time algorithm gives comparable results than the two other algorithms. This last point has been confirmed by other experiments that are not reported here. Our real-time implementation is thus achieved without detriment to the subjective quality of the results.

5.2. Spatiotemporal Results

In the experiments, we take and (i.e., a region is a changing region when it contains at least of changing pixels). The values of these thresholds are the same for all the video sequences.

In order to see the influence of our temporal process, we show here an example of segmentation results with and without time consistency in Figures 8(c) and 8(b). We can see that the segmentation of the wall is the same for the two frames and of the video sequence “Table” when we use the time-consistency improvement.

We then propose to display the segmentation results along the video sequence “Akiyo” in Figure 9 and the video sequence “Paris” in Figure 10. We can observe that the method gives satisfying and stable results for these sequences.

We have also tested the robustness of our method in the case of a shot cut. The video sequence “Football" is followed by the video “BBC Disc". Experimental results are given in Figure 11. We can observe that the spatial segmentation of the first frame of the video “BBCDisc” is not influenced by the spatial segmentation of the previous frame that belongs to the video “Football". Indeed, in this case, most of edges belong to the third category of edges () where the predicate is recomputed.

5.3. Evaluation of Time Consistency

We use a classical measure to evaluate time consistency. Given the segmentation of the previous frame and the segmentation of the current one , we find a correspondence between regions in and . For each region , we choose the region that produces the most overlapping area We then sum the overlap measures for all the regions in . The consistency measure is the percentage of this number to the size of the image. The results for this measure are given in Table 1 for the video sequences “Akiyo”, “Table Tennis”, “Paris”, and “Mobile”. When enforcing consistency through the , time consistency is higher, and visually, segmentation is more stable from frame to frame and still fit very well regions boundaries as shown in Figures 9 and 10. We can also see that the time consistency of the spatial segmentation algorithm SRM [4] is roughly equivalent to the time consistency of our spatial algorithm without computation of the CDM.

5.4. Evaluation of the Computational Cost

In this section, we propose to give the number of Mcycles the algorithm takes on TriMedia for different resolutions and different versions of our algorithm. We propose to compare the spatial computational cost with the one obtained using the Nock algorithm [4].

The computational cost has been evaluated as a function of the image size in Figure 12. In this figure, the computational cost (in ) has been computed for one image of the video “Akiyo” at different resolutions (QCIF, CIF, SD, and two other resolutions). This computation has been performed with and without the optimisations described in Section 4.2. First, the results given in Figure 12 show that the complexity is approximatively linear regarding the image size. Indeed, the spatial computational cost is principally induced by the union-find algorithm and the edges sorting. As explained in Section2.3, the sorting step is performed in a linear time . As far as the union-find algorithm is concerned, the complexity is given by where is the number of union operations and is the number of find operations . The function is a very slowly growing function [25]. Since the number of find operations can be upper-bounded by where is a constant, the complexity at worst can be approximated by which gives an almost linear complexity. This assessment is confirmed by the experimental results given in Figure 12.

We then propose to compare the computational cost of our algorithm to the SRM algorithm [4]. The main difference between the two spatial algorithms lies in the computation of the predicate. The predicate of SRM leads to higher computational cost as demonstrated in Table 2. Our algorithm gives a lower computational cost even without optimisations. When including these improvements, the computational cost decreases. In Table 2, we also give the number of Mcycles the algorithm takes on TriMedia when enforcing the temporal consistency. The exploitation of the reduces the computational cost. This reduction depends on the correlation between two successive frames.

With a 450 MHz TriMedia, we are able to process more than frames per second. We can then conclude that our algorithm is avalaible in real time for QCIF or CIF sequences.

6. Discussion

Designing usable algorithms for video processing requires low-computational methods. Directed by this constraint, we propose here an efficient time-consistent algorithm for video segmentation. Let us discuss the strengths and limitations of our algorithm regarding the three main points of this work.

(i)Spatial segmentation: we propose here an alternative statistical modelisation to the work of Nock and Nielsen [4]. This leads to a simpler predicate for merging that is more adapted to a real-time implementation and gives good results for the spatial segmentation. However, as in [4], such a statistical model is dedicated to the segmentation of flat regions and may produce an over-segmentation on textured area of an image.(ii)Temporal consistency: the proposed algorithm allows to obtain both stable segmentation results and a reduction of the computational cost. This method is based on the use of a CDM and of region information deduced from the first frame. Regions are not linked from one frame to another leading to a video segmentation algorithm that is robust to scene cut and occlusion. However, if never this algorithm has to be exploited for video object tracking, region matching will be useful. It can be obtained by comparing regions of two consecutive frames using statistical inequalities.(iii) Hardware implementation : our algorithm runs in real time for CIF sequences. For standard definition (SD) or high definition (HD) sequences some further efforts are needed. In order to obtain a real-time implementation, we have directed our attention to the parallelisation by blocks of the spatial segmentation. However, we still investigate this part and notably the merging of the different spatial segmentations obtained for the different blocks. This last step remains delicate.

We finally want to outline that such a real-time video segmentation algorithm would help many video algorithms by leading to a better comprehension of the image content. Among applications, we can think of time conversion, peaking (also named unsharp masking), video compression, or deinterlacing. The region segmentation algorithm can be exploited directly using regions boundaries and region color properties or as a source of information on the image content (level of noise, complexity of the scene, main colors) which can be exploited to better design existing algorithms [33]. Our on-going research is also directed to the design of such region-based algorithms for electronic devices (e.g., : set-top box).

Acknowledgments

The authors would like to thank Patrick Meuwissen, O. P. Gangwal, and Zbigniew Chamski for their constructive suggestions. We also would like to thank the reviewers for their very useful comments and suggestions.