We propose a time-consistent video segmentation algorithm designed for real-time implementation.
Our algorithm is based on a region merging process that combines both spatial and motion information.
The spatial segmentation takes benefit of an adaptive decision rule and a specific order of merging.
Our method has proven to be efficient for the segmentation of natural images with few parameters to be set.
Temporal consistency of the segmentation is ensured by incorporating motion information through the use of an improved change-detection mask. This mask is designed using both illumination differences between frames and region segmentation of the previous frame. By considering both pixel and region levels, we obtain a particularly efficient algorithm at a low computational cost, allowing its implementation in real-time on the TriMedia processor for CIF image sequences.
1. Introduction
The segmentation of each frame of a video into
homogeneous regions is an important issue for many video applications such as
region-based motion estimation, image enhancement (since different processing
may be applied on different regions), 2D to 3D conversion. These applications require two main features from segmentation: accuracy of regions boundaries in
the spatial segmentation and temporal stability of the segmentation from frame
to frame.
As far as spatial segmentation is concerned, it can be
classified into two main categories, namely, contour-based and region-based
methods. In the first category, edges are computed and connected components are
extracted [1]. One of
the drawbacks of such an approach is that the computation of the gradient is
prone to large errors especially on noisy images. Moreover, the closure of the
edges in order to create connected regions is a difficult task and an efficient
resolution of such a problem may induce cumbersome computations. Finally, such
an approach cannot take benefit of statistical properties of the considered image regions. The region-based segmentation methods avoid these
drawbacks by considering regions as basic elements. Among region-based
segmentation methods [2–6], we are interested here in a
bottom-up segmentation approach where regions are grown using a merging
process. In such approaches, similar neighbouring regions are merged according
to a decision rule [7, 8]. The initial regions can be the pixels or an
over-segmentation of the image which can be obtained by a watershed algorithm
[9, 10]. As mentioned by [11], bottom-up algorithms rely
on three notions: a model for the description of a region, a merging predicate,
and a merging order. This gives rise to numerous heuristics according to the
different choices performed on these three steps [4, 7, 12–14]. Compared to other classical approaches, for example,
[7, 12, 13], the authors of [4] have proposed recently an adaptive threshold justified
by statistical inequalities. They obtain good results with few parameters to
tune. However, in the context of a real-time implementation, their merging
predicate still requires too many computations. Moreover, their algorithm is
dedicated to the segmentation of still images and so, it does not take into
account the temporal dimension of video sequences.
When dealing with video segmentation, various
algorithms have been tested in the literature. The first class of approaches
proposes to perform a 3D segmentation by considering the spatiotemporal data as
a volume. We can cite the work of [15] that takes benefit of the 3D structures tensor for
segmentation. Some other recent works propose 3D approaches using a
mean-shift-based analysis [16, 17]. Let us note that if each shot is segmented as a 3D
volume, the number of frames to store for each segmentation may be unbounded.
On the other hand, if the number of stored frames is artificially limited by
the available memory, some 3D regions may be artificially split on long shots.
Therefore, 3D approaches require the storage of several frames in memory and
necessitate a high bandwidth which is a drawback for the design of electronic
devices.
The second class of methods concerns frame-by-frame
algorithms. In these approaches, the spatial segmentation of the second frame
is deduced from the spatial segmentation of the first frame using motion
estimation [13, 18–20]. Regions from adjacent frames are then merged
according to motion similarity, colour similarity, or localisation similarity.
In such approaches, a matching is performed between regions of the different
frames. All the regions are then linked and video objects tracking algorithms
[20] may then take
benefit of such a correspondence between regions.
On the other hand, some applications, such as image
enhancement or video compression, may need a coherent segmentation between
frames without requiring an exact tracking of each region from frame to frame.
In this paper, we propose a segmentation algorithm devoted to such
applications. The first aim of our algorithm is thus not to match the regions
of two consecutive frames but only to take benefit of the spatial segmentation
of the first frame in order to construct a coherent spatial segmentation of the
second one.
Our contributions may be divided in three points.
(i)Spatial
segmentation: our spatial segmentation takes benefit of both an adaptive
decision rule and an original order of merging. As in [4], the adaptive threshold is
computed using a statistical modelisation of the region combined with the statistical inequality of McDiarmid [21]. However, in our approach, each pixel is modelled as a single random variable (in [4], the authors model each pixel as a sum of random variables). This method gives a simpler predicate that is more adapted to real-time implementation. Good results are obtained for spatial segmentation with few parameters to be set.(ii)Temporal consistency: another contribution is the design of a region segmentation that does not encounter strong variations over time. We propose to simply take benefit of scene-change detection, that is widely used in video segmentation [22–24], rather than motion estimation that remains a real bottleneck for real-time implementation. We construct a coherent segmentation from frame to frame by combining both pixel and region information through the use of an improved change detection mask () that takes benefit of the region segmentation of the previous frame. Experimental results conducted on real video sequences
demonstrate a good temporal consistency.(iii)Hardware
implementation: as far as the implementation is concerned, we exploit the data
level parallelism (DLP) by processing some basic treatments in parallel. Moreover, the classical union-find data structure [25] is improved by using local registers to reduce the access time of find operations. We obtain an efficient algorithm for video segmentation at a low computational cost. Our method runs in real time on the
TriMedia processor for CIF image sequences.
The paper is organised as follows. The spatial
segmentation method is detailed in Section 2. The temporal consistency
improvement is explained in Section 3. In Section 4, we discuss the
implementation of the algorithm. Experimental results and measures are given in
Section 5.
2. Spatial Segmentation
Let us consider an image the notation represents the
cardinal and the pixel
intensity at position in the frame
A region-based segmentation problem aims at finding a
relevant partition of the image domain in regions We focus here on region-merging algorithms where a
decision criterion determines whether two regions must be merged or not. In
this paper, we first introduce a statistical model for the regions. We then
detail how these statistical tools are used for the computation of the merging
predicate. We finally explain the whole merging algorithm and especially the
order of merging.
2.1. Statistical Model
Images are
corrupted by noise which gives random values (r.v.) to pixel intensities. Due
to this random part in image acquisition systems, an image is classically
considered to be an observation of a perfect statistical image The intensity of a pixel is then
modelled as the observation of a random vector whose values
belong to the interval (e.g., for bits images).
An ideal region is then represented
by a vector of independent r.v. where Let us denote by the real region
associated to , that is, composed of the same set of pixels as . The intensity of the pixel of within is then
considered as an observation of the r.v. . Following [4], we define a partition of into
homogeneous regions by the
following requirements:
(1)all the pixels
of any statistical region should have the same expectation
(2)two adjacent pixels belonging to different statistical regions should have different expectations
Such a definition may be easily extended to
multichannel images [4] by requiring that the pixel expectations are equal on
each channel within one region and that the expectation of at least one channel
differs between pixels belonging to different regions.
Note that according to our definition, all the pixels
of one region should have the same expectation. The regions extracted by a
segmentation algorithm based on this definition should thus be composed of
pixels with a nearly constant intensity (we thus assume an underlying flat
facet model). This criterion may be justified by the reflective properties of
surfaces. Indeed, the reflection of light under a surface is determined by a
Lambertian and a specular component [26]. The specular component produces specular spikes
often characterised by regions with a nearly maximal intensity. The specular component
decreases abruptly and may be neglected, within a segmentation scheme, outside
the specular spikes. The intensity of a Lambertian surface varies slowly
according to its normals. A region of the image with a nearly constant value
correspond thus either to a specular spike or to a Lambertian surface with an
almost constant normal. Such a segmentation scheme provides thus a partition
which resumes the main physical and geometrical properties of a 3D scene.
Higher-level processes such as the segmentation of the image into objects or
the segmentation of textured objects [27] would require to input within the algorithm a priori
knowledge about what are the expected objects of the scene or what a textured
area is.
In order to be selfcontent, let us now introduce the
very useful statistical inequality proposed by [21] and introduced within the region segmentation framework by [4]. We take benefit of this inequality for the computation of the merging predicate.
Theorem 1 (McDiarmid's inequality). If are independent random variables whose observations take their
values in a measurable space and is a function that satisfies the following constraint for :
where and are two
different possibilities for the th component of an
observation vector Then for every ,
2.2. Merging Predicate
In order to compute a merging predicate, we consider
two regions and of a current
partition. The associated vectors of r.v. in the ideal image are
respectively denoted by and . The r.v. and denote
respectively the means of and . We suppose that and belong to the
same homogeneous region of . Our default decision rule consists thus to merge the
two regions and , respectively associated to and . However, under the hypothesis that and are included in
the same homogeneous region of the probability that is greater than
a given value is bounded by Theorem 1. If this probability falls under a
given threshold, we refuse the hypothesis and thus do not merge the two regions and .
More precisely, let us consider the vector
and the mean functions
Our merging decision rule is based on the following theorem.
Theorem 2. Let one consider two vectors of r.v. and encoding the
intensities of two connected regions of an ideal image . Under the hypothesis that and are included
into the same homogeneous region and using the previously defined notations, one has
where denotes the size of vector (i.e. the
cardinal of the associated region ).
Proof. Let us consider the vector in . This vector may be considered as an outcome of the
r.v. . In order to apply the McDiarmid theorem we define
the following function:
where and .
Let us compute the variation of the function. If we make a variation on the intensity of one with . We have
This gives us the value of the bounding coefficients for the first
variables. Similarly, if we make a variation on the intensity of we obtain . We then compute the sum over all the
variables:
Moreover, according to our hypothesis, if and belong to the
same homogeneous region of all the pixels of and have the same
expectation. We have thus, and we obtain
the expected result using conjointly Theorem 1 and (10).
Note that the bounds on the probability provided by
Theorem 2 may be equivalently represented by
After some basic calculus we find that, under the assumption that and are included
into the same homogeneous region of we have with a probability at most
with
Below the probability , which is supposed to be low, we consider that the
event is not
probable. In this case, we refuse the initial hypothesis stating that and belong to the same homogeneous region of and thus do not merge the two regions. Our merging predicate may thus be stated as
follows:
where and denote
respectively the values of and for the
observation . These two terms represent the mean value of the two
regions and . The term denotes the
maximum level of for gray-scale images).
Note that our merge criterion is equivalent to
The left member of this last equation corresponds to the difference between the squared error of and the sum of the squared errors of and [28]. Our merge criterion may
thus be also interpreted as a bound on the increase of the squared errors of
the regions.
Our criterion may be adapted to multichannel images as
follows:
where represents the mean value of the region for the channel taken in the set of channels and denotes the
maximum value on channel . We take the maximum of the values obtained for each
channel as a criterion. Indeed, if the predicate is true, it will be true for
all the channels and so the merge hypothesis is accepted. In this paper, we
have chosen the space which is the native colour space of video sequences.
Both our method and the one of Nock [4] are based on the McDiarmid inequality. However, Nock models each pixel of the ideal image as a sum of random
variables whereas our method only uses one r.v. per pixel. The approach
proposed by Nock consists to fix the probability and to use in order to
vary the merge threshold. To our point of view, the probability below which we
refuse the merge hypothesis has a more straightforward interpretation than the
variable . The resulting criteria are slightly different, our
criterion differs by a factor from the one
first proposed by Nock. Our criterion is also significantly different from the
second Nock criterion which uses an estimate of the number of final regions
whose cardinal is equal to a given value. However, both our criterion and the
final Nock criterion may be related, our one being more strict than the one of
Nock [4] for a given
probability .
Let us note finally that the way we derived our
criterion provides an alternative explanation to the eventual over-merging
produced both by our algorithm and the one of Nock. Indeed, our basic
hypothesis consists to suppose that and belong to the
same homogeneous region of As in a contrario approaches first introduced by
[29], we refuse this
hypothesis only when we observe an event which has a low probability (according
to ) to occur
under this hypothesis. We may thus merge regions corresponding to different homogeneous
regions of if our
observation does not contradict our hypothesis.
2.3. Merging Order
An edge denotes a
couple of adjacent pixels in a
4-connectivity scheme. The set of edges of an image is denoted by and the number
of edges by The order of merging is built on the edges weights
as in [4, 12]. The idea behind this order
of merging is to merge first similar regions rather than different ones. The
similarity between pixels is measured by computing the distance between two
pixel colours as follows:
For colour images, the edge weight becomes
where
denote the three channels of a particular colour space.
Note that alternative weight may be designed. For
example, one may balance the distance along each axis of a color space by some
weight (or equivalently scale each axis according to its weight). Numerous
colour space with different properties may be chosen in (17). For our algorithm, we consider the colour space which is the native colour space of CIF sequences. The colour space provides
partitions with a little greater subjective quality but with a higher
computational cost.
The edges are sorted in an increasing order of their
weights and corresponding couples of pixels are processed in this order for
merging. This sorting step only requires two traversals of the image: the first
traversal allows to compute the histogram of edge weights. The second traversal
stores each edge in an array associated to its weight. The amount of memory
required for each array is deduced from the histogram of edge weights. This
sorting step is similar to the one usually used within the watershed algorithm
[9].
2.4. Merging Algorithm
Our spatial segmentation could be divided in three steps. In the first one, we compute the weights of edges and their histogram.
In the second step, we sort edges increasingly according to their weights. In
the last step, we merge pixels or regions connected by edges following their
order. Algorithm Algorithm 1 describes more
particularly the merging loop.
Algorithm 1: Merging
regions algorithm.
The term represents the
number of edges within the image in the
4-connectivity. In the merging process, we use the union-find data structure
[25]. The union
function merges two disjoint regions into one region, and the find function
identifies the region to which a certain pixel belongs. Implementation details
are given in Section 4.
3. Time Consistency Improvement
In video segmentation, the quality of the spatial segmentation is not the only
requirement, time consistency is also a very important one. If, in two
successive frames, one region is segmented very differently because of noise,
occlusion or deocclusion, results of segmentation would be very difficult to exploit
for any application like image enhancement, depth estimation, and motion
estimation. Many works, see for example [19], use motion estimation to improve time consistency in
video segmentation. However, motion estimation [30] is a real bottleneck for real-time implementation and is even sometimes unreliable. In this paper, we combine an improved change detection mask with spatial
segmentation in order to improve the temporal consistency of our segmentation.
3.1. Change Detection Mask
The CDM is designed using both illumination differences between frames and region segmentation of the previous frame.
We first detect changing pixels using the frame
difference. Then, we take benefit of the region segmentation of the previous
frame in order to classify the pixels not only at a pixel level but also at a
region level.
Given the current frame and the
previous one
the frame difference is given by
Classically, is thresholded
in order to distinguish changing pixels from noise. The pixel label is given
by
where is a positive constant chosen according to the noise level of the image. This threshold may
be set experimentally (Section 5) or estimated according to any measure of the image noise. A pixel , with , is considered as a changing pixel. We then use the
previous segmentation in order to convert the from the pixel
level to a region level which is more reliable [23]. For each region in the previous
segmentation, we compute :
which denotes the number of changing pixels of the current image whose coordinates
belong to in the previous
segmentation. We then compute which
represents the ratio of changing pixels between the previous and the current
image in the region Pixels are then classified using three categories:
where is a positive
constant. In the experiments, we take (i.e. a region is a changing region when it contains at least of changing pixels). The value of the threshold is chosen so that we do not miss any changing region.
Every pixel of regions qualified as static is labelled
using . The two other labels concern pixels within changing
regions. Depending on the value of the frame difference, the pixel is qualified
as a changing one or as a one . Such a classification is then used to segment the current frame. An example of
classification is given in Figure 1 for the video sequence “Table”.
Figure 1: Computation of the using the
difference between the current image and the previous one and the region
segmentation of the previous frame.
3.2. Merging Process
The merging process is now divided in three main
steps. Firstly, static regions are kept as they were segmented in the previous
frame. Secondly, we apply a connected component labelling (CCL) algorithm
[31] to extract
connected components of pixels with . This second step builds seeds from the segmentation
of the previous frame. These seeds link the current segmentation to the
previous one in a time-consistent way. Thirdly, we apply the spatial
segmentation only on edges connecting a
changing pixel within a changing region to a pixel
belonging to a changing region. This last pixel may be either changing or
static Note that
static pixels within changing regions have been connected in the second step by
a CCL algorithm.
The whole process can be formalised as follows.
Considering an edge between two pixels, we define the following function:
The function allows
us to classify the edges in the following three categories (a brief summary is
provided by Figure 2).
Figure 2: The figure
gives the different combinations of pixels available for each category. The
pixels are designed as follows : black pixel , gray pixel , white pixel
.
(i)The first category
(Figure 2(a)) corresponds to the edges which have at least one pixel
belonging to a static region. These edges are not considered for the
segmentation of the current image . Static regions are then segmented in the same way
between two successive images and . (ii)The second category (Figure 2(b)) corresponds to the edges that connect two non changing pixels in
changing regions. For these edges, we simply apply a connected component
labelling (CCL) algorithm [31].(iii)The third category (Figure 2(c)) corresponds to the edges which have at least one pixel that is
considered as a changing one (i.e. ). These edges
are processed using the merging order and the merging predicate defined in
Section 2.2. Edges belonging to this category are
denoted by
Figure 3 describes the three steps corresponding to the
process of the three categories of edges.
Figure 3: Description of the three steps of the segmentation
process for the video “Table". (a) Gives the different values of the
CDM. (b), (c), and (d) describe the evolution of the process of the edges with
respectively and In these three last figures, black pixels are pixels
that have not yet been classified, whereas white pixels correspond to region
boundaries found at each step.
In Section 5, we propose the computation of an objective measure for temporal consistency. The measures obtained on real video sequences demonstrate a real
improvement of temporal consistency. Moreover, the way we exploit the decreases also
the computational cost of the algorithm since the edges in static area are not
reconsidered, and those linking the “no changing pixels” in changing area are
simply processed by a CCL algorithm.
When successive images are not correlated (in the case
of a scene cut, e.g.), the set contains most of the edges of the image which leads to a new spatial segmentation as shown in the example of a shot cut given in Figure 11. Our algorithm handles, thus,
naturally the shot cuts and does not need to be combined with a shot cuts
detection algorithm.
4. Implementation Considerations
In this section, we propose to describe optimisations
that have been made to allow a real-time treatment. The whole algorithm of
video segmentation is summarised in Figure 4.
Figure 4: The general diagram of video segmentation.
Apart from the merging loop, all other functions
access pixels data in a predictable way (e.g., from top to bottom left to
right). The cache memory benefits from this regularity, since it exploits
spatial and temporal locality of data, and consequently causes less cache
misses. In the merging loop, the union-find data structure is unpredictable,
and consequently causes an important data cache stalls. To reduce the data
cache stalls cycles, we investigate some optimisations that are detailed in the
following sections and we take benefit of the TriMedia processor to exploit the
data level parallelism (DLP) and instruction level parallelism (ILP) of our
algorithm.
4.1. Organisation of Data
Our organisation of data should allow an efficient
computation of both our merge criterion (13) and our union and find
operations. Let us recall that when using an union-find merging scheme each
region of the image is encoded by a spanning tree whose vertices are the pixels
of the region. These tree data structures are usually encoded by storing for
each pixel the index of its parent within the spanning tree. The information
about the region are associated to the root of the trees and both the roots and
the region information are updated during an union operation.
Since our merge criterion only uses the mean color and the
cardinal of the regions,
one simple organisation of our data would consist in associating each pixel with the fields , where denotes the
father of within the tree.
However, grouping the region data and the father index
would require to manipulate the whole vector within find
operations. Since only the father field is required by the find operation such
an organisation of the data would induce the storage of useless data within the
cache memory.
We thus decided to store into two separate arrays the
data required for the merge operations (namely the vector and the
encoding of the trees. More precisely, our organization of data is as
follows:
(1)one array Data which stores
for each created region its fields;(2)one array Father which encodes
our sequence of union operations;(3)one array Label of size initialised to
a special flag indicating that each pixel is initially its own father.
If a region is reduced to a single pixel , Label is set to a special flag and the data of the region retrieved from the image . We thus decide to create a new entry within the
array Data only if the
associated region is composed of at least pixels. More
precisely, if a merge of two pixels and is decided by
our merge criterion,
(1)a new entry is created
within the array Data and initialised
according to and (2)label and Label are set to (3)father is set to a
special flag indicating that has yet no
father.
Our data structure is further updated in the two following cases.
(1)One pixel is aggregated
to an already created region labelled by . In this case, Label is set to and Data is updated
according to . The array Father remains
unchanged.(2) Two already created regions with respective labels and are merged. In
this case, one of the labels (say ) survives, Data is updated according
to Data and Father is set to .
Figure 5(b) illustrates the state of our different data
structures after the segmentation of Figure 5(a). Two pixels in Figure 5(a) are merged if they have the same label. In this example, we first considered
horizontal edges between pixels and then vertical ones. Both horizontal and
vertical edges have been considered using a scan line order. Note that the
array Data is completely filled by the four regions created during the union operations.
We only get three final regions as encoded by the array Father where all
labels, except label 2, are their own father.
Figure 5: The data
structures used to compute union-find operations and our merge criterion.
Since all regions encoded by the array Data are composed of
at least pixels, the
maximal number of entries within this array is equal to . Moreover, the vertices of the trees encoded by the
array Father correspond to
regions composed of at least pixels. The
maximal size of the array Father is thus also
equal to . Note that this upper bound may be reached if we
first decompose the image into regions made of adjacent pixels
and then order the merges in such a way that the tree encoding the union of all
these elementary regions is linear.
Note that when using such an organisation of data, all
the required memory is allocated before union and find operations. We thus
avoid the risk of a memory overflow.
4.2. TriMedia Processor
We experimented this data organisation on the TriMedia
processor [32]. The
cache memory of this particular TriMedia is 128 KByte, 4 way associative, with block of 128 Byte. The replacement algorithm used is .
In order to increase the computational efficiency, we
propose to take benefit of the data level parallelism (DLP) provided by our
algorithm (computation of edge's weight, frame difference, classification of
pixels in ). This allows
to increase the throughput (i.e., amount of pixels processed per unit time) by
processing data in parallel when it is possible. The core of TriMedia is a VLIW
architecture with 5 issues slots. Each slot has some functional unit, and each
functional unit could process 4 bytes in parallel (SIMD mode). The instruction
level parallelism (ILP) is extracted by the compiler, while the DLP could be
exploited through the use of custom operations, loop unrolling, and grafting.
So we use these optimisations to exploit the DLP available in our algorithm.
5. Experimental Results
In this section, we present experimental results of
our algorithm run on TriMedia with many very known video
sequences.
5.1. Spatial Results
The probability tunes the coarseness
of the segmentation. In Figure 6, we show the influence of this parameter on the
level of details obtained. This parameter is highly correlated to the number of
segmented regions. A value of this parameter around provides a
sufficient level of details for most of the video sequences we have considered.
However, the chosen value and the associated level of details are highly
dependent on the application. We can remark that this algorithm is able to
segment very precisely small regions of interest such as the mouth or the eyes
of “Akiyo”. It can also segment the different numbers of the calendar in the
sequence “Mobile”. However, we can observe an over-segmentation of some textured
regions such as the wall in the sequence “Table”. This is mainly due to the
fact that assumption (1) is more adapted to the segmentation of flat regions.
Our ongoing research is directed towards the design of a new merging criterion
for the segmentation of textured regions.
Figure 6: Segmentation of
one frame of the video sequences “Akiyo", “Table",
“Mobile" with , , .
As a comparison, we propose here some results obtained
with two other well-known algorithms: algorithm EGBIS [12] and the statistical region
merging (SRM) algorithm of Nock and Nielsen [4]. These two algorithms are based on region merging
schemes with the same merging order than our method. The main difference
between the three methods lies in the merging predicate. The results are
displayed in Figure 7. For each algorithm, we have tuned the parameters in order
to reach a segmentation that allows a good subjective representation of the
elements of the image (numbers of the calendar, eyes of the woman, etc.). We can
see on these examples that our real-time algorithm gives comparable results
than the two other algorithms. This last point has been confirmed by other
experiments that are not reported here. Our real-time implementation is thus
achieved without detriment to the subjective quality of the results.
Figure 7: Comparison of our segmentation results with those
obtained using the algorithms EGBIS [
12] and SRM [
4].
5.2. Spatiotemporal Results
In the experiments, we take and (i.e., a region
is a changing region when it contains at least of changing
pixels). The values of these thresholds are the same for all the video
sequences.
In order to see the influence of our temporal process,
we show here an example of segmentation results with and without time
consistency in Figures 8(c) and 8(b). We can see that the segmentation of the wall is the same for the two frames and of the video
sequence “Table” when we use the time-consistency improvement.
Figure 8: Comparison of the segmentation results obtained with
and without time consistency on the video sequence “Table”.
We then propose to display the segmentation results
along the video sequence “Akiyo” in Figure 9 and the video sequence “Paris” in
Figure 10. We can observe that the method gives satisfying and stable results for
these sequences.
Figure 9: Results for the spatiotemporal segmentation of two
video sequences “Akiyo" .
Figure 10: Results for the spatio-temporal segmentation
of two video sequences “Paris" .
Figure 11: Experimental results in the presence of a video scene cut. (a) Segmentation of the last frame of the video “Football". (b) Segmentation of the first frame of the first image of the video“BBCDisc".
We have also tested the robustness of our method in
the case of a shot cut. The video sequence “Football" is followed by
the video “BBC Disc". Experimental results are given in Figure 11. We
can observe that the spatial segmentation of the first frame of the video
“BBCDisc” is not influenced by the spatial segmentation of the previous frame
that belongs to the video “Football". Indeed, in this case, most of
edges belong to the third category of edges () where the
predicate is recomputed.
5.3. Evaluation of Time Consistency
We use a classical measure to evaluate time
consistency. Given the segmentation of the previous frame and the
segmentation of the current one , we find a correspondence between regions in and . For each region , we choose the region that produces
the most overlapping area
We then sum the overlap measures for all the regions in . The consistency measure is the percentage of this
number to the size of the image. The results for this measure are given in
Table 1 for the video sequences “Akiyo”, “Table Tennis”, “Paris”, and
“Mobile”. When enforcing consistency through the , time consistency is higher, and visually, segmentation
is more stable from frame to frame and still fit very well regions boundaries
as shown in Figures 9 and 10. We can also see that the time consistency of the
spatial segmentation algorithm SRM [4] is roughly equivalent to the time consistency of our spatial algorithm without computation of the CDM.
Table 1: Experimental measures of time consistency.
5.4. Evaluation of the Computational Cost
In this section, we propose to give the number of
Mcycles the algorithm takes on TriMedia for different resolutions and different
versions of our algorithm. We propose to compare the spatial computational cost
with the one obtained using the Nock algorithm [4].
The computational cost has been evaluated as a
function of the image size in Figure 12. In this figure, the computational cost
(in ) has been
computed for one image of the video “Akiyo” at different resolutions (QCIF,
CIF, SD, and two other resolutions). This computation has been performed with
and without the optimisations described in Section 4.2. First, the results given in Figure 12 show that the complexity is
approximatively linear regarding the image size. Indeed, the spatial
computational cost is principally induced by the union-find algorithm and the
edges sorting. As explained in Section2.3, the sorting step is performed in a linear time . As far as the
union-find algorithm is concerned, the complexity is given by where is the number of union operations and is the number of find operations . The function is a very slowly growing function [25]. Since the number of find operations can be upper-bounded by where is a constant,
the complexity at worst can be approximated by which gives an almost linear complexity. This assessment is confirmed by the experimental results given in Figure 12.
Figure 12: Evaluation of
the computational cost regarding the image size (with one image of the video Akiyo, ).
We then propose to compare the computational cost of
our algorithm to the SRM algorithm [4]. The main difference between the two spatial
algorithms lies in the computation of the predicate. The predicate of SRM leads
to higher computational cost as demonstrated in Table 2. Our algorithm gives a
lower computational cost even without optimisations. When including these
improvements, the computational cost decreases. In Table 2, we also give the number of Mcycles the algorithm takes on TriMedia when enforcing the temporal
consistency. The exploitation of the reduces the
computational cost. This reduction depends on the correlation between two
successive frames.
Table 2: Compuational cost.
With a 450 MHz TriMedia, we are able to process more than frames per
second. We can then conclude that our algorithm is avalaible in real time for
QCIF or CIF sequences.
6. Discussion
Designing usable algorithms for video processing requires low-computational methods.
Directed by this constraint, we propose here an efficient time-consistent
algorithm for video segmentation. Let us discuss the strengths and limitations
of our algorithm regarding the three main points of this work.
(i)Spatial
segmentation: we propose here an alternative statistical modelisation to the
work of Nock and Nielsen [4]. This leads to a simpler predicate for merging that is
more adapted to a real-time implementation and gives good results for the
spatial segmentation. However, as in [4], such a statistical model is dedicated to the
segmentation of flat regions and may produce an over-segmentation on textured
area of an image.(ii)Temporal
consistency: the proposed algorithm allows to obtain both stable segmentation
results and a reduction of the computational cost. This method is based on the
use of a CDM and of region information deduced from the first frame. Regions
are not linked from one frame to another leading to a video segmentation
algorithm that is robust to scene cut and occlusion. However, if never this
algorithm has to be exploited for video object tracking, region matching will
be useful. It can be obtained by comparing regions of two consecutive frames
using statistical inequalities.(iii) Hardware
implementation : our algorithm runs in real time for CIF sequences. For
standard definition (SD) or high definition (HD) sequences some further efforts
are needed. In order to obtain a real-time implementation, we have directed our
attention to the parallelisation by blocks of the spatial segmentation.
However, we still investigate this part and notably the merging of the
different spatial segmentations obtained for the different blocks. This last
step remains delicate.
We finally want to outline that such a real-time video
segmentation algorithm would help many video algorithms by leading to a better
comprehension of the image content. Among applications, we can think of time
conversion, peaking (also named unsharp masking), video compression, or
deinterlacing. The region segmentation algorithm can be exploited directly
using regions boundaries and region color properties or as a source of
information on the image content (level of noise, complexity of the scene, main
colors) which can be exploited to better design existing algorithms [33]. Our on-going research is
also directed to the design of such region-based algorithms for electronic
devices (e.g., : set-top box).
Acknowledgments
The authors would like to thank Patrick Meuwissen, O. P. Gangwal, and Zbigniew Chamski for their constructive suggestions. We also would
like to thank the reviewers for their very useful comments and suggestions.