Abstract

We propose to model a tracked object in a video sequence by locating a list of object features that are ranked according to their ability to differentiate against the image background. The Bayesian inference is utilised to derive the probabilistic location of the object in the current frame, with the prior being approximated from the previous frame and the posterior achieved via the current pixel distribution of the object. Consideration has also been made to a number of relevant aspects of object tracking including multidimensional features and the mixture of colours, textures, and object motion. The experiment of the proposed method on the video sequences has been conducted and has shown its effectiveness in capturing the target in a moving background and with nonrigid object motion.

1. Introduction

Tracking an object within a video sequence is an important task in computer vision and has found applications in a variety of fields such as surveillance, machine intelligence, and even medical treatment. Object modelling is the capstone of tracking technologies, and different modelling methodologies typically lead to different scopes and capabilities. The two main application categories closely related to object modelling are object detection and object tracking. In object detection, the averaging method [1] is the most straightforward to extract the background for the frames, and thus the object, for a relatively static background scene, while other methods may resort to using a mixture of Gaussians to adaptively model each pixel [2], the kernel density [3] to statistically represent the background, or the local correlation maps [4] to exclude outliers in estimated motion vectors.

The traditional template matching [57] tries to match the whole object region directly with another in a new frame and locate the new object position there that achieves minimum and acceptable matching errors. Kalman filer was utilised [8] to make the template matching also adaptive to object occlusions. Since direct template matching is in general of high computational cost and has its own restrictions, contour approaches consequently gained considerable attention for their advantages on dealing with objects of deforming shapes and on lessening the computational complexity. In particular, the snake model first introduced in [9] has been extended into many variants [1014] of the active contour model. They extract the object contour based on certain contour deforming criteria such as the minimisation of an energy function. More recently, kernel functions have been widely used [1517] in estimating the likelihood of a given pixel to be on an object of interest or the probabilistic differences between the candidate and the target object regions, while multihypothesis methods and Bayesian inference [10, 18, 19] have been employed to propagate object contours or the like into the future frames. The condensation method in [11], for instance, propagates the contour of the tracked object in the framework of a posterior probability. In the harsh environment such as tracking a camouflaged object [20], motion features [10] may have to be additionally considered.

The template matching, though intuitive, has limitation on processing objects of deformable shapes even though certain deformable template methods have been proposed for the performance improvement. The contour approach, on the other end, has to rely heavily on the presence of strong object boundary. The kernel approach, as its own downside, often requires a heavier computational load. For all these different algorithms, there still need to be additional sanity assumptions in the form of such as rigid movement, limited illumination change, as well as the static background. Since significant features such as strong edges and distinctive textures are known to be indicative of the presence of the tracked object, it is anticipated that any feature that differentiates the object from the background will be good choice for the tracking purpose. Such features may be selected from a given pool [21] or as a result of enhancing the brighter or darker aspect of the object by the wavelet filters [22]. In most applications, however, essentially one feature [1520, 22] is utilised at a time, and this may be largely due to the curse of multidimensionality. In this connection, our aim is to devise a framework that would allow us to incorporate several features at the same time in locating the object from the background, simplify the modelling representation, and then direct the tracking within the framework of Bayesian inference.

The main purpose of this work is to model the tracked object of nonrigid shape by one or several object features, such as colours and textures, so as to more accurately model the object and robustly resist the environmental disturbances and noises. This modelling will be based on multidimensional cubes of features, rather than one-dimensional bins of one feature, and will estimate the object location by the probabilities of its presence. Unlike the one dimensional case, the number of feature cubes could increase very rapidly in multidimensions. However, we found that nullifying “insignificant” cubes will still maintain the tracking logic while reducing the computational load. The location of the object will be represented in terms of probabilities and will be estimated in a Bayesian framework. In this regard, one could first approximate the local object and background densities in terms of the features such as colours and textures and then derive with the Bayesian inference the object probability that refers to the probability of a pixel belonging to the object.

This work is organised as the following. In Section 2, we establish the framework for the use multidimensional feature cubes and the Bayesian inference. Different ways of combining several features are also inspected there. We then look into a few different types of features in Section 3, including colours, textures, and the motion. A simplistic approach to better synchronise frame background is also explored. We then devise in Section 4 a scheme to select features of dominance to model the object. Section 5 then develops a method of shape consolidation and extraction so that an extracted object with background noises can be effectively enhanced. The experimental results are subsequently reported in Section 6 for a variety of video sequences. Finally Section 7 contains a short conclusion.

2. Multidimensional Feature Space

A video sequence {𝐹𝑖}𝑖0 consists of a series of frames, and 𝐹𝑖 thus represents a frame of 𝑚×𝑛 pixels. A general frame 𝐹𝑖 typically contains a tracked object of interest, 𝑇𝑖, whose contour boundary is denoted by Γ𝑖, within a local window 𝐿𝑖, see Figure 1. For a given single feature, its pixel histogram [𝑎,𝑏], where is the set of natural numbers, [𝑎,𝑏] is the range of the feature values, and (𝑥) is the frequency of the value 𝑥, can be normalised to 𝐼 on the unit interval 𝐼=[0,1], or further normalised to a distribution density 𝜌(𝑥)=(𝑥)/10(𝑥)𝑑𝑥. For a d-dimensional feature, such as a mixture of colours and textures, its pixel histogram will similarly take the form𝑎𝐷,𝐷1,𝑏1𝑎××𝑑,𝑏𝑑,(1) where [𝑎𝑘,𝑏𝑘] represents the scope of the 𝑘th feature value.

If 𝑆 and 𝑇 are two image components, we denote by 𝑆𝑇 that each image pixel of image component 𝑆 is also part of image component 𝑇, and by 𝑇𝑆 the remaining part of 𝑇 after 𝑆 is taken from it. For a given 𝑇𝐿𝐹 with 𝐵=𝐿𝑇 like those in Figure 1, we can thus define the histogram 𝑇, 𝐵, and 𝐵 on them, respectively. Likewise, we can also define the densities 𝜌𝑇(𝑥) for the target object, 𝜌𝐵(𝑥) for the background within a local window 𝐿, and 𝜌𝐵(𝑥) for the background of the whole frame 𝐹.

In order to minimise the modelling data and reduce the complexity, we first extend the one-dimensional notion of “bins” for the colours to multidimensional “cubes” for features. Basically, all the feature values will be put into certain cubes in such a way that (i) any two feature values belonging to the same cube are “close” in terms of the physical nature with which the feature is defined and (ii) all cubes together form a disjoint partition and are completely ordered. For this purpose, we partition each feature domain [𝑎𝑘,𝑏𝑘] into the union of intervals of width 𝜔𝑘>0, and the width of 𝑘th dimension of a typical d-dimensional cube is thus 𝜔𝑘. The choice of the weight factor 𝜔𝑘 allows different feature values to impact modelling at different scales. For any two feature vectors 𝑥,𝑦𝐷, the union of all the cubes, we define their distance via the infinity norm𝑥𝑦=max1𝑖𝑑||𝑥𝑖𝑦𝑖||𝜔𝑖.(2) We note that the common Euclidean metric 2 will not be able to allow us easily partition the feature space. If the 𝑘th cube has a centre feature vector 𝑐=𝑐𝑘, then that cube is defined by 𝐼𝑘𝑥1,,𝑥𝑑𝜔𝑖2𝑥𝑖𝑐𝑖<𝜔𝑖2,1𝑖𝑑,(3) and obviously 𝑥𝑐1/2 and 1𝑘𝑁𝐼𝑘𝐷𝐷, where 𝑁 is the total number of the cubes. For any image component 𝑇, we can determine the histogram 𝑇 and likewise determine 𝐵 for the background. These histograms essentially represent the probabilities of a given pixel falling on the object 𝑇 or the background 𝐵 according to their feature values. If the features are chosen to be sufficiently discriminative, then they may be used to locate or track the given object in different video frames through the Bayesian inference which will be explained in the followings.

2.1. Bayesian Inference

The Bayesian inference is a standard statistical technique that involves collecting evidence 𝐸 that is meant to be consistent or inconsistent with a given hypothesis 𝐻, and as evidences accumulate, the degree of belief of the hypothesis changes. Bayesian inference uses a numerical estimate of the degree of belief in a hypothesis before evidence has been observed and calculates a numerical estimate of the degree of belief in the hypothesis after evidence has been observed. Bayes’ theorem thus states 𝑝(𝐻𝐸)=𝑝(𝐸𝐻)𝑝(𝐻)/𝑝(𝐸), where 𝑝(𝐻𝐸) is the posterior probability of 𝐻 given 𝐸, an improvement on the originally estimated prior probability 𝑝(𝐻). In a more general case of having 𝑛+1 mutually exclusive hypotheses 𝐻𝑖, the prior probabilities 𝑝(𝐻𝑖) can be improved into the posterior probability 𝑝(𝐻𝑜𝐸) based on a set of additionally observed evidence via𝑝𝐻𝑜=𝑝𝐸𝐸𝐻𝑜𝑝𝐻𝑜𝑛𝑖=0𝑝𝐸𝐻𝑖𝑝𝐻𝑖,(4) where 𝑝(𝐸𝐻𝑖) is the likelihood of the hypothesis 𝐻𝑖 under the observed evidence 𝐸. For our specific problem, (4) takes the form 𝑝𝑡(𝑝𝑇𝑥)=𝑡(𝑥𝑇)𝑝𝑡(𝑇)𝑝𝑡(𝑥𝑇)𝑝𝑡(𝑇)+𝑝𝑡(𝑥𝐵)𝑝𝑡(𝐵),(5) where 𝑝𝑡(𝑇) and 𝑝𝑡(𝐵) are the prior probabilities estimated prior to observing the actual pixel value, 𝑝𝑡(𝑥𝐵) denotes the probability of having the value 𝑥 by a pixel on the background 𝐵, and 𝑝𝑡(𝑇𝑥) denotes the probability of being part of the object 𝑇 for a pixel of value 𝑥. Since the total object area remains somewhat constant across the nearby frames, we will assume a constant 𝑝𝑡(𝑇)/𝑝𝑡(𝐵) and use the initial 𝑝0(𝑇) and 𝑝0(𝐵) for all the neighbouring frames. Since the object and background densities are very close across consecutive frames, 𝑝𝑡1(𝑥𝑇) and 𝑝𝑡1(𝑥𝐵) or the like can be made use of to approximate those at 𝑡 in (5). Alternatively, if the feature values for the background are simple and distinctive, then one can estimate the probability of a pixel being on the background, as opposed to being on the object. Such a probability can be estimated similar to (5) via𝑝𝑡(𝑝𝐵𝑥)=𝑡(𝑥𝐵)𝑝𝑡(𝐵)𝑝𝑡(𝑥𝑇)𝑝𝑡(𝑇)+𝑝𝑡(𝑥𝐵)𝑝𝑡(𝐵),(6) from which 𝑝𝑡(𝑇𝑥) can be easily estimated.

For the calculation of 𝑝(𝑥𝑇), for instance, it can be derived from the histograms of the pixels on the object 𝑇. In the case of independent feature variables {𝑥𝑖}, the probability 𝑝(𝑥𝑇) becomes separable𝑝𝑥(𝑥𝑇)=𝑝1𝑥𝑇𝑝𝑑𝑇,(7) and the probabilities 𝑝(𝑥𝑖𝑇) can be calculated individually. Even if the feature variables are implicitly correlated to certain extent, it is still possible to use the separable form (7) to approximate the 𝑝(𝑥𝑇), and this gives more flexibility in selecting the feature variables. In most cases, all the variables 𝑥𝑖 correspond to the affirming features in that a larger value indicates a higher probability on the object 𝑇. However, if a particular 𝑥𝑘 is in fact a negating feature as opposed to the affirming one, then 𝑝(𝑥𝑘𝑇) can be replaced by 𝑝(𝑥𝑘𝑇)=1𝑝(𝑥𝑘𝑇), or even estimated somewhat differently to adjust the significance of that feature variable.

2.2. Incorporation with Other Features

When an object is being tracked in a video sequence, it is natural to expect that the more features one monitors the better and more robust the tracking outcome, particularly when the prime features of colours and textures happen to be relatively weak in a particular shot. These additional features could be in the form of colours and textures, motion information, the physical restriction of a rigid body, or a domain mask on the object shape. These will obviously depend on the individual application scenarios.

In the case of extracting additional motion features, one can represent [10] the probability density function 𝑝𝐷(𝑑) of the observed interframe difference 𝐷𝑡(𝑠)=𝐼𝑡(𝑠)𝐼𝑡1(𝑠) in terms of a static background density 𝑝𝑆(𝑑) and a conditional mobile density 𝑝𝑀(𝑑), where 𝑆 and 𝑀 refer to the static and motion components, respectively, and 𝑑 denotes the difference value in intensity. In other words, one has the mixed model 𝑝𝐷(𝑑)=𝑃𝑆𝑝𝑆(𝑑)+𝑃𝑀𝑝𝑀(𝑑), and the model parameters 𝑃𝑆, 𝑃𝑀 and Θ can be estimated by maximising the joint likelihood, Π𝑑𝐷𝑝𝐷(𝑑𝑃,Θ), through the maximum likelihood principle or the method of expectation maximization [23], where 𝑑𝐷 means 𝑑 is to go through 𝐷𝑡(𝑠) for all the pixel positions 𝑠 there. We note that the motion frames 𝐷𝑡 can be applied a formula similar to (5) for the estimation. For any pixel of intensity 𝑥, if the corresponding intensity difference in 𝐷𝑡 is 𝑑, then the motion data can improve [10] the estimation to 𝑝𝑡(𝑇𝑥) via𝑝𝑡(𝑝𝑇𝑥)=𝑡(𝑇𝑥)𝑝𝑡(𝑑𝑀)𝑝𝑡(𝑀)𝑝𝑡(𝑑𝑀)𝑝𝑡(𝑀)+𝑝𝑡(𝑑𝑆)𝑝𝑡(𝑆).(8) In a particular tracking application, additional features or geometric restrictions may also be utilised to refine the tracking accuracy. We note that although abs(𝐷𝑡(𝑠)) does not fit mixed Gaussian models mentioned above, the absolute frame difference often works equally well directly.

In general, when there are two separate probability maps, 𝑝(𝑥) and 𝑞(𝑥), derived for different features or from different methodologies, one can also combine the two maps in different ways. Suppose that after thresholding and enhancement, 𝑝(𝑥) and 𝑞(𝑥) lead to the object areas 𝑃 and 𝑄, respectively, we can also incorporate the relative physical distances to synthesise them via𝑟(𝑠)=𝛼𝑝(𝑥)1+𝑑𝑎+(𝑠,𝑄)𝛽𝑞(𝑥)1+𝑑𝑎+(𝑠,𝑃)𝛾𝑝(𝑥)𝑞(𝑥)1+𝑑𝑎(𝑠,𝑃𝑄),(9) where 𝛼, 𝛽, and 𝛾 are the coupling constants, 𝑥=𝑥(𝑠) is the pixel value at position 𝑠, and 𝑑𝑎(𝑠,𝑃)=𝜂[𝑠𝑃𝑠𝑠]/|𝑃| for a nonempty set 𝑃 and 𝑑𝑎(𝑠,)= for the empty set . For calculational simplicity, one may sometimes replace 𝑑𝑎(𝑠,𝑃), the average distance of 𝑠 to 𝑃, simply by 𝜂𝑠𝑐(𝑃), where 𝑐(𝑃) is the centre of 𝑃 and 𝜂 is a parameter that adjusts the coupling strength. We note that the usual spatial-independent combination 𝛼𝑝(𝑥)+𝛽𝑞(𝑥), or 𝑝(𝑥)𝑞(𝑥), [1(1𝑝(𝑥))(1𝑞(𝑥))], each corresponding to a different approach to combine the probability maps, are all special cases of the extended 𝑟(𝑠) in the above.

3. Some Variety of Features

Colours and textures are obviously those most prominent local features one usually encounters when locating an object from a background environment. There are different colour spaces, offering a variety of possible colour features. For simplicity, we will consider only the RGB colour space and the HSV colour space, as the rest are very much similar. While most colour based approaches are template based and subimages are searched or matched blockwise [24], our proposed scheme is largely pixel based and probabilistic.

The literature on the use of texture in object tracking is quite scarce compared with that on the use of colours. One relatively recent work [25] presented an efficient approach that is based on the authors’ customer-made texture of local binary patterns (LBP). Their work is typically applicable to video shots of a static camera, and the LBPs there served as the bases for the multimodal structure for the identification of the object motion.

At a given image pixel, the texture around the pixel is determined by the pixel values in the neighbourhood. There can be many different ways to define a local texture, ranging from mean, standard deviation, to LBP or Φ=𝑝1𝑘𝑚𝑟𝑞𝑘𝑥𝑅𝑘||𝑥𝜇𝑘||𝑝||𝑅𝑘||,(10) where 𝑘𝑅𝑘 is a partition of the neighbourhood of pixel position, 𝑅𝑘 is a concentric annulus, 𝜇𝑘 is the mean of the disk 𝑅𝑘𝑖𝑘𝑅𝑖, and 𝑟𝑘 is the radius of the annulus 𝑅𝑘. In the simplest case of 𝑝=2, 𝑞=0 and 𝑚=1, Φ is just the standard deviation.

To avoid the participation of the noises in the calculation of textures, we can use the correlated texture which is defined through a set of correlated feature values such as a set of selected colour values. For instance, if {𝐼𝑘} is a set of ranges of leading colours for the object, then we calculate the texture property, such as the mean, the standard deviation and the skew, based on only these selected colours, correlated through the common object, by ignoring all the other colours. The determination of the feature values and their significance may vary slightly along with the size of the local window 𝐿𝑖. Ideally, 𝐿𝑖 should be large enough to contain the object in the next frame, and a more representative local window can reduce noises for pixels at a distance from the object.

Although motion is generally considered a very strong feature, if present, the extraction of object motion is not easy when the camera is not steady, let alone when the camera is meant to move around. Hence, we limit ourselves here to extracting motion features only from frames that exhibit steady or very slow-varying background. Suppose that there are two frames of 𝑚×𝑛 pixels 𝐹={𝑓(𝑠)𝑠2,𝑠=(𝑠1,𝑠2),0𝑠1𝑚,0𝑠2𝑛} and likewise 𝐹={𝑓(𝑠)}, their difference will highlight the moving object if 𝐹 and 𝐹 share the same background. By extracting the mixed models for the difference frame, an additional feature of motion can be incorporated as in (8). It is also possible to make use of the difference frame directly, at the expense of involving with physical pixel locations explicitly. This time it serves more like a probabilistic mask for the object.

In the case of an unsteady camera, one can improve the quality of the difference frame by shifting pixel positions slightly. By shifting the row and column position of the frame 𝐹 by 𝑎 and 𝑏, respectively, becoming frame 𝐹={𝑓(𝑠+Δ)} for Δ=(𝑎,𝑏), the difference frame between 𝐹 and 𝐹 will minimise locally [𝑠𝐷𝑜|𝑓(𝑠+Δ)𝑓(𝑠)|]/[(𝑚|𝑎|)(𝑛|𝑏|)], where |𝑎|𝑚, |𝑏|𝑛 and𝐷𝑜 denotes the overlapped area of 𝐹 and 𝐹. When the shifts 𝑎 and 𝑏 are not all integers, then an interpolation on the neighbouring pixel values is needed; see Figure 2. for the case of |𝑎|1 and |𝑏|1, where the pixel value at the point (𝑟+𝑎,𝑐+𝑏) marked by a small square is to be interpolated from its four neighbour pixels drawn in solid disks. The interpolation is done via the two pixels marked by empty circles which are in turn derived from the interpolation of their respective two neighbours in solid disks, and the formula reads ||||+||𝑏||||𝑏||||𝑏||𝑓(𝑟+𝑎,𝑐+𝑏)=𝑎𝑏𝑓(𝑟+sgn(𝑎),𝑐+sgn(𝑏))(1|𝑎|)𝑓(𝑟,𝑐+sgn(𝑏))+|𝑎|1𝑓(𝑟+sgn(𝑎),𝑐)+(1|𝑎|)1𝑓(𝑥,𝑦),(11) where the sign function is defined by sgn(𝑥)=1 if 𝑥>0, =1 if 𝑥<0, and =0 if 𝑥=0. When the background is well preserved locally even though it changes across frames due to camera panning, it is often possible to generically synchronise the consecutive frames so that the difference of the synchronised frames can be used to extract the motion components. Given two frame images 𝐹={𝑓(𝑠)} and 𝐹={𝑓(𝑠)}, the initial synchronisation displacement (𝑟,𝑐), and an initial navigation step size 𝜒, we compare the synchronisation error [𝑠𝐷𝑜|𝑓(𝑠)𝑓(𝑠+Δ𝑠)|]/(intersection area of 𝐹 and 𝐹) for Δ𝑠=(𝑟,𝑐)+(Δ𝑟,Δ𝑠) with Δ𝑟 and Δ𝑠 being ±χ or 0. The displacement (𝑟,𝑐) then updates to (𝑟,𝑐)(𝑟,𝑐)+Δ𝑠 for the Δ𝑠 that minimises the synchronisation error. If (𝑟,𝑐) differs from (𝑟,𝑐), then the process repeats again. Otherwise, reduce χ by a half and also repeat the process again. The algorithm terminates when the synchronisation displacement reaches the required precision. Once the difference of the synchronised frames is derived, it becomes also possible to use mixed model to extract the motion distribution or simply use the difference frame directly. Just as in the case of meerkat sequence Figure 24, we expect that pixel displacement within a single pixel is in general not so significant in extracting the motion component. Even though the above simplistic algorithm can stop at a local minima, it is surprising that it works for a great majority of frames even for bird video Figure 12 (red colour selected), where the camera panning shifts the background in 10 s of pixels.

4. Choice of Dominant Features

Suppose that {𝐼𝑖}𝑖𝐼 is a partition of the feature space, with 𝑐𝑖 as its centre. Then, an image component 𝑆 is said to contain feature vector 𝑐𝑖𝐼𝑖 if for a threshold 𝜏min>0 one has𝐼𝑖𝜌𝑆(𝐼𝑥)𝑑𝑥𝑖𝜏min.(12) For a multidimensional feature space, there are typically a great number of feature cubes 𝐼𝑖 to consider, resulting in potential computational complexity. While for the one-dimensional feature space, we can just directly use the whole set of cubes, or bins in this case, to conduct the object tracking, it is desirable to significantly reduce the number of cubes from the inspection horizon.

How can we dynamically select the right feature cubes so that an object of interest can be more effectively and efficiently tracked within a video sequence? Suppose that a general pixel has a d-dimensional feature vector 𝑥=(𝑥(1),𝑥(2),,𝑥(𝑑)), then for each single feature element 𝑥(𝑗), we locate the feature cubes that best differentiate the tracked object 𝑇 against the local background 𝐵=𝐿𝑇. More precisely, we could locate the feature cubes 𝐼𝑘 such that the following regularity conditions:𝛾Δ𝑘(𝑗)||𝐼𝑖||𝜂,Δ𝑘(𝑗)𝐼𝑘𝜌𝑇(𝑥)𝜌𝐵(𝑥)𝑑𝑥,(13a)𝐼𝑘𝜌𝑇(𝑥)𝑑𝑥𝐼𝑘𝜌𝐵(Δ𝑥)𝑑𝑥𝜏>0,(13b)𝑘(𝑗)𝐼𝑘𝜌𝑇Δ(𝑥)𝑑𝑥𝜏>0for𝛾=1,(13c)𝑘(𝑗)𝐼𝑘𝜌𝐵𝛾𝜌(𝑥)𝑑𝑥𝜏>0for𝛾=1,(13d)𝑇(𝑥)𝜌𝐵(𝑥)0,𝑥𝐼𝑘(13e)hold for a constant 𝛾=±1, see Figure 3. Intuitively, (13a) requires that the chosen features are sufficiently discriminative for the object and the background, (13b) requires that the object area is not diminishing, and (13c)–(13e) require that the feature difference over an interval of interest should be sufficiently observable with respect to the object or to the background. We note that the threshold 𝜏 in (13b)–(13d) is to ensure that the feature cube 𝐼𝑘 does not get overwhelmed by the frame background. For the general case of feature vectors, we can apply this scheme to each feature component 𝑥(𝑗) so as to collect a set of cubes of differentiating features {𝐼(𝑗)} with 𝐼(𝑗)=𝑘𝐾(𝑗)𝐼𝑘(𝑗), or𝐼𝑘(𝑗)𝑘𝐾(𝑗),𝑗=1,,𝑑,(14) where 𝐾(𝑗) contains the indices for the chosen cubes. For convenience, we will refer to such a collection 𝐼(𝑗) of cubes as the discriminating band (DB) for the (𝑗)th feature component and use negative index 𝑘 to denote the case of dominant background feature for 𝛾=1. The main reason that we explore all available features instead of just one is that one feature component, such as red in RGB, may not suffice to tell apart of parts of the tracked object 𝑇. However, a direct composition of all the feature components may lead to unnecessarily duplicated cubes in terms of search effect and to missing some of those less differentiating features which are nonetheless responsible for telling apart other areas of the object from the background. Moreover, different feature space may exhibit different power of distinguishing the object and the background.

Let 𝑆 be an image component, we denote by 𝑑(𝑠,𝑆)=min𝑠𝑠𝑠𝑆,(15) the physical distance of the pixel position 𝑠 to the component 𝑆. Let 𝑀 be a pixel mask of same size as 𝑆, then we denote by 𝑆𝑀 the set of those masked pixels in S. That is, for any 𝑆𝑖,𝑗𝑆, 𝑆𝑖,𝑗(𝑆𝑀) if 𝑀𝑖,𝑗0 and 𝑆𝑖,𝑗(𝑆𝑀) if 𝑀𝑖,𝑗=0. Then, we can use the following procedure to obtain those cubes of tracking features.(1)For a given frame 𝐹 and an object 𝑇𝐹, select a local window 𝐿 such that 𝑇𝐿 and {𝑠𝐹𝑑(𝑠,𝑇)𝛿}𝐹 for a constant distance 𝛿>0. Initialise a mask 𝑀 for 𝐹.(2)Choose the next unprocessed feature component, the (𝑗)th feature component 𝑥(𝑗). For the masked image 𝑆𝑀, calculate the normalised histograms and locate the corresponding discriminating band 𝐼(𝑗).(3)Repeat step (2) if current feature component does not induce a good DB 𝐼(𝑗), or repeat (2) after converting the image into another feature space. Otherwise, go to step (4).(4)Stop, if all feature components have been considered or if current collection of selected feature cubes already well covered the object 𝑇. Otherwise, go to step (5).(5)Update the mask 𝑀 by marking off those pixels contributing to the features in the DB 𝐼(𝑗), and then go back to step (2).

We note that larger distance between the density peaks of the object 𝑇 and background 𝐵, and larger value in such as |Δ𝑘(𝑗)| in (13a)–(13e) typically correspond to a better tracking performance. We also note that the masking in the above can be made optional for simplicity if needed. By annihilating all the features other than those selected in {𝐼𝑘(𝑗)}, we can construct a histogram based on those in {𝐼𝑘(𝑗)} for 𝑘0 and then normalise it to the probability density function 𝑝𝑇(𝑥). Likewise, we can also construct the probability density function 𝑝𝐵(𝑥) for the local background based on {𝐼𝑘(𝑗)} for 𝑘<0 via (6).

5. Shape Consolidation and Extraction

When an object is extracted from a new frame, it may be in the form of a haze of pixels, particularly when they are estimated probabilistically. Even though it may not matter that much if one is to estimate its rough area with a method like the ellipse fitting algorithm we would use mostly for our experiments later on, it helps if one can enhance the object image or probability maps further on. Given any pixel position 𝑠2, the total weight of its neighbourhood Ω2 centred or pivoted at 𝑠,𝑤(𝑠)=𝑠Ω𝑥𝑠+𝑠||Ω||,(16) where |Ω| denotes the cardinality of set Ω, can be used to determine how the pixel value 𝑥(𝑠) at position 𝑠 should be enhanced to 𝑥(𝑠) according to𝑥(𝑠)=𝑠Ω𝛼𝑠𝑥𝑠+𝑠||Ω||,if𝑤(𝑠)𝜏,(17) and 𝑥(𝑠)=0 if otherwise, where 𝜏>0 is a controlling threshold, the 𝛼’s are weight constants, and Ω2 is another neighbourhood of 𝑠 which may or may not be the same as Ω. A simple such Ω can be taken as Ω={(0,0),(±1,0),(0,±1),(±1,±1)} which can be represented by the 3×3 grid of 1’s, and a simple choice of the weight constants 𝛼(𝑠) can be made as 𝛼(𝑠)=𝛽/(𝑠+1) for Euclidean distance and a given constant 𝛽. This is essentially an iterative process we first considered in [20]. In general, we choose 𝛽=𝛾/[𝑠Ω1/(𝑠+1)] with 𝛾1, and slightly larger, so that the enhanced pixel value 𝑥 is at about the same scale as the original pixel values, and slightly better. As the iteration proceeds, the total number of nonzero pixels will initially decrease due to the annihilation of isolated small patches of pixels and may, however, eventually turn around and increase instead because certain choices of β may make the enhanced solid body slighter expansive.

The choice of 𝜏 controls how far or fast the enhancement shrinks or expands the image. If the average intensity 𝜇 of the object is estimated by the average of the nonzero pixels 𝑤(𝑠), then 𝜏 should be of the scale 𝜇/2. In particular, if Ω is a (2𝑘+1)×(2𝑘+1) grid of pixels as the dotted box in Figure 4, then the intensity 𝑤(𝑠) at the centre point 𝜇𝑘/(2𝑘+1)𝜇/2 will leave the pixel at 0 intensity if 𝜏=𝜇/2. For the extraction of the object borderline, one can use essentially the known erosion technique [26]. One can first convert all nonzero pixels 𝑥(𝑠) of the image into the binary 1, then calculate the averaged image 𝑤(𝑠) via (16) for the Ω of 3×3 grid and again convert the nonzero 𝑤(𝑠) into 1, and finally derive the borderline as the nonzero pixels of the subtraction of these two images.

When noises are somewhat extensive in an image or map to be enhanced, a more effective way of removing the noises is to set the pixel threshold according to the percentage of the “target” pixels in its neighbourhood. For a given image to be enhanced, assumed to be gray scaled for simplicity without loss of generality, the intensity of the target object is expected to fall within a certain domain Ψ. For instance, if the difference image of two consecutive frames is to be enhanced, the brighter the pixel intensity, the more likely it is a pixel of a moving object. The domain Ψ in such a case could be chosen as for instance the top 1% of the nonzero pixels. Let such a domain Ψ(𝜉) be controlled by a parameter 𝜉. Then, for any pixel 𝑥 at location 𝑠, if less than 𝜁 percent of pixels in its neighbourhood Ω belong to Ψ, we increase the threshold 𝜏 in (17) by a factor 𝜏 to increase significantly the threshold because that pixel position under consideration is deemed far away from the target object. In other words, the 𝜏 in (17) is to be dynamically determined by the 𝜏(𝑠) below:𝜏𝜏(𝑠)=,if|𝜔|||Ω||𝜏𝜁,𝜏,if|𝜔|||Ω||<𝜁,(18) where 𝜔={𝑠𝑥(𝑠)Ψ,𝑠Ω𝑠} represents those relatively sure pixels on the target in the neighbourhood Ω of 𝑠. Figure 5 shows the effectiveness of this method of discriminative threshold for a feature frame from the meerkat video sequence in Figure 24. The original (half frame) is depicted in subimage 1, the two consecutive enhancements via (17) and (18) are shown as subimages 2 and 5, the two consecutive enhancements via (17) without using (18) are displayed as subimages 3 and 6, subimage 4 shows an single step of enhancement without (18), and finally subimage 7 illustrates the outcome of applying four consecutive steps of enhancements without (18). We note that subimages 2–7 contain only cropped middle area of the enhanced subimage 1. We, thus, observe that subimage 5 of discriminative threshold retains better the object body than subimages 4, 6, and 7.

6. Experiments

We first illustrate for a video sequence that many a single feature may be utilised on its own to track certain type of objects if the feature is distinctive enough, and different features may be able to trace or highlight different aspects of the tracked objects. We then show how we can improve the tracking accuracy by combining several features together at the same time, that is, by utilising a multidimensional feature vector.

6.1. Single Feature in Full Range

First, we observe the 1st frame in Figure 6 has the following images in HSV colours respectively in Figure 7. If we pick hue as the base feature, then the posterior probability of the object for HSV and RGB are sequentially in the 6 image plots in Figure 8. If we use hue colour as the feature for tracking, then a threshold will easily extract the object from the probabilistic map in the above figure. When the probability map is somewhat “foggy”, then certain enhancement techniques may need to be utilised. This could be in the form of incremental threshold followed by incremental region growth, or in the form of (16) and (17). The increment property is to ensure a good trade off between removing noises and keeping the dominant part of interest. Such regional shrinkage and growth are typically controlled by the intensity at a given pixel and the average in the local neighbourhood. Another technique is to use region masking, and this is based on the idea that if one knows the rough whereabout of the object, then one simply nullify directly the probability of those “far away” pixels. The procedure is as follows: (i) select a large radius for the object; (ii) choose the centre of the object in the previous frame as the approximate centre of the object in the current frame; (iii) nullify directly the probabilities of those exceeding the mask radius from the approximate centre; (iv) update the centre and repeat the above steps for several times if necessary. The probability map then indicates where the object is, as in Figure 9. We note that one can also make use of the method of dynamical threshold, as in (18), to determine if a pixel is far away from the object or not.

For an object of free form like a bird, tracking is largely about locating, where the object is rather than where precisely the object border locates. In this regard, one can use a preselected base shape to fit the object. Such a base shape could be a rectangle or an ellipse for instance. For this purpose, we first locate the centre of the object and draw a straight line through the centre. Then, locate the centres of two separate halves and draw a straight line linking them; see Figure 10. The fitting is done when these two straight lines become perpendicular, which resembles the diagram in Figure 11, where the two heavy dots denote the centres of the corresponding halves. Since the area of the right half of the ellipse in Figure 11 is 𝜋𝑎𝑏/2 and the moment of that area with respect to the 𝑦 axis is (2/3)𝑎2𝑏, we have 𝑅×𝜋𝑎𝑏/2=2𝑎2𝑏/2 and thus 𝑎=3𝜋𝑅/4. By evening out the potential imbalance between the left and right halves through an average, the radius on the axis connecting the two centres of the halves can be calculated based on the ellipse shape and thus reads𝑎=3𝜋8𝑟1𝑟2+𝑐1𝑐2+𝑟2𝑟2+𝑐2𝑐2,(19) where (𝑟1,𝑐1) and (𝑟2,𝑐2) are the row and column of the two centres, respectively. Swapping the roles of the centre lines one can then directly calculate the radius 𝑏 on the other axis. For clarity, we may also enlarge such 𝑎 and 𝑏 uniformly by a few pixels, resulting in the outer ellipse in Figure 9 for instance.

As a result of all these employed methodology and techniques, the typical tracked objects are exhibited in Figure 12 in which the tracked parts are circled by the red ellipses. If one wishes to follow the steps in Section 5, one can further improve the object shapes from the ellipses in Figure 12 to their closer form; see the first few frames in Figure 13.

We note that if one wishes to select the motion feature for this tracking, synchronisation of the frame background is needed because the background view shifts in 10 s of pixels across each frame. We applied the background synchronisation described late in Section 3 to the 20 consecutive frames, on mainly the red colour space. The synchronisation all succeeded apart from frames 11, 14, and 17, which in turn succeeded in hue for frame 11 and in green for frames 14 and 17. This shows the motion component can also serve well to track the bird in this sequence. In Figure 14, we depict a typical improvement on the right from the direct frame difference on the left by the background synchronisation.

6.2. Single Feature of Selected Values

If we plot the histograms for both the object in red and the background in blue, as for HSV and RGB in the following Figure 15, we observe that some colours differentiate better the object and the background.

For the hue histograms, we observe that the object histogram (in red) is most discriminative at 0.57 which corresponds to the 46th bin for the 80 bin histogram, and the background peaks at 0.29 corresponding to the 24th bin. In this experiment, we show that by selecting just several discriminative feature values and nullifying the histogram for the other values, the tracking is often still possible. This shows that one may merely opt for several feature values to simplify the histograms and consequently the whole calculation. In this example, we push this idea to the extreme and nullify the whole histogram apart from the 46th bin for the hue colour. The histogram becomes trivial as in Figure 16, but the tracking has not been much affected; see Figure 17.

We can also select the dominant values for the background to serve the purpose, and we just need to make use of (6) instead of (5). Figure 18(a) displays the posterior probability map for the background, with its reverse displayed in (b) for better visibility, while (c) shows the corresponding histograms. We note that we enlarged a little the size of the local window to make it more representative of the background.

6.3. Feature Varieties

Since a distinctive feature could be a particular colour or texture, or any contrived mathematical quantity, we illustrate below a few different type of features for our tracking purpose. First, we examine the use of the local standard deviation. For each pixel point of a chosen image space, calculate the standard deviation of the neighbouring, say 9 or 21, pixel values. For the video sequence in Figure 19(a) below, for instance, its red colour image in standard deviation of 21 neighbours is shown in (b) while its posterior probability image is depicted in (c). The corresponding histograms are plotted in (d), while the tracked object base shape is outlined in green in (b).

We also applied the LBP as the feature value to this video. The LBP image is shown in Figure 19(e), its posterior probability map is depicted in (f), and the corresponding histograms are plotted in (g). The LBP seems to somewhat randomise the original image, as shown in the histograms, and fails to serve as an acceptable feature for the object tracking. For the texture of standard deviation, it served fine for a few frames until similar textures become abundant in the neighbourhood of the object. It is overall less robust than the colour features.

However, if we utilise the correlated texture as the feature for tracking, then the outcome is much improved. In this connection, we experimented on the video sequence in Figure 6 with the textures defined by LBP, while the correlated blue colour ranges are chosen as 0.55 to 1. The tracking is fine, and the result is shown in Figure 20 in which brighter ellipses indicate later object traces and the slight overall colour change is due to the neglect of those unselected blue colours as well as the LBP application to those colours.

6.4. Combination of Several Features

There are two different ways to utilise multiple features for the tracking purpose. They differ according to at which stages these different features are combined. One method is to track simultaneously and separately with each individual feature and then unify their separate results together using some kind of confidence voting system, thus leading to an improved tracking performance. The walking dog in the following sequence in Figure 21, displayed every 4 frames, is tracked this way in terms of the use of hue and saturation colours.

The other is to track the combined features. In the next example of camouflaged meerkat, if we use the red-circled parts in Figure 22 as object and background, the tracking via the saturation colour is possible during the initial steps of “clean” environment, as is indicated in the histogram Figure 23(a). However, it is less stable when the meerkat is close to other camouflaging objects like the tree stumps and the rock. If we add these (within green circles) to the background samples, then they represent the blue spikes in the middle of Figure 23(c) for the hue colour in comparison with the corresponding flat part in (b) when these are not added. This indicates that the use of an additional feature, the hue colour, will differentiate the background better and make the tracking more robust. In fact, the hue colour of the rock is quite dominant as is shown also in Figure 23(d) which plots the case of the rock (within green circle) against the background of tree stumps (within green ellipse) and the meerkat (within the larger red ellipse). To improve the robustness, we adopt saturation colour as the affirming feature with range selection 0.05 to 0.15 and adopt hue colour as the negating feature with ranges  .55 to  .75 and 0 to  .1 The 𝜔𝑖 in (2) are set to equal to each other and to 1/160, and a combination form similar to (7) is also utilised. The tracking results are shown in Figure 24.

Alternatively, we can select the rock’s feature to consolidate its background status via (6) or simply treat it as a background object. Figure 25(a) shows the rock is sampled in a circle while nonrock environment is sampled in two ellipses. The rock feature here is exemplified with the use of texture LBP on the saturation and then mean of the 21 neighbourhood cells. Figure 25(b) shows the rock probabilities which can be incorporated with the normal tracking in Figure 22.

Since the meerkat is moving in this video sequence, the motion feature can stand out the object from the cluttered background. If we use the frame differences to extract the motion feature according to Sections 2.2 and 3, choosing in this case simply 𝛼=𝛽=𝜂=0 in (9) and with no extra frame synchronisation as the frame background is already static enough, then we can combine just the features for the saturation colour and for the motion. Figure 26 illustrates a clear separation of the object meerkat from the camouflaging rocks. Figures 26(c) and 26(d) display the enhanced motion feature for frames 11 and 12, while Figures 26(a) and 26(b) show the consequent tracking effect. We note that the motion feature also detects the tiny head of a second meerkat coming into the scene which was not detected in the previous schemes.

Further experiments can be done with other type of features such as other type of textures. It is also possible to apply to several objects at the same time, with all but one such object treated as essentially the background. However, these are beyond our current scope.

7. Conclusion

We proposed to represent the object in terms of dominant feature vectors of colours and textures, and possibly motion, in the local environment and use them to track the object in the video frames. Such effective feature elements can be extracted dynamically for the object modelling. The feature elements are determined by their collective power to distinguish the object from its background. It is also noted that the impact of multidimensionality can be significantly reduced if insignificant feature cubes are directly nullified.

Acknowledgments

The author thanks Zhuan Qing Huang for making available some useful Matlab-coded functions and video clips of her own and Xiling Guo for some programming assistance.