Abstract

Object tracking is one of the fundamental problems in computer vision, but existing efficient methods may not be suitable for spatial object tracking. Therefore, it is necessary to propose a more intelligent mathematical model. In this paper, we present an intelligent modeling method using an enhanced mean shift method based on a perceptual spatial-space generation model. We use a series of basic and composite graphic operators to complete signal perceptual transformation. The Monte Carlo contour detection method could overcome the dimensions problem of existing local filters. We also propose the enhanced mean shift method with estimation of spatial shape parameters. This method could adaptively adjust tracking areas and eliminate spatial background interference. Extensive experiments on a variety of spatial video sequences with comparison to several state-of-the-art methods demonstrate that our method could achieve reliable and accurate spatial object tracking.

1. Introduction

Mathematical formalism is probably the most precise and logical language in science research. It is typical for researchers in pure natural sciences to attempt to describe observed phenomena using mathematical correlations. However, because of real-world scenarios, it is often very difficult to construct a perfect and permanent mathematical model for one specific issue in engineering fields [1]. During the last years, the effort concentrated in self-optimizing and self-adaption was leading to a new field between mathematics and applications, called intelligent modeling [2]. In this paper, we propose a new intelligent modeling method using the enhanced mean shift method based on a perceptual spatial-space generation model for spatial object tracking. Object tracking has been applied to many fields, such as video surveillance [3], robot recognition [4], and traffic control [5]. In spatial on-orbit docking, object tracking could be used to track spatial aircraft and assist with ground control. Because the spatial images are mainly generated from low-rate videos [6] or airborne spectral imagery [7], which are captured by the aircraft sensors [8, 9], their resolution and spatial-temporal coverage are not very ideal. In addition, because of differences in the sensor spectral bands, acquisition position, and contrast gradient setting, there are shifts in the relative position and scale zoom in multisource images with the same scene. All this will bring influence to spatial object tracking results.

Recently, multidimensional decomposition and multiscale representation methods have been widely applied to image processing and computer vision. Mumford and Gidas proposed the stochastic model [10], in which truncation errors and noise interference could be isolated from the discrete domain. Witkin and Koenderink proposed the image scale-space model [11, 12], which cleared noise interference in fine scales, and analysis errors could decrease in coarse scales. Burt and Lindeberg proposed the coarse-to-fine model [13, 14], which could reduce useless gradients in gradient entropy calculation. In the object tracking field, the traditional method is rectangular block region tagging [15, 16]. In [17], Isard and Blake applied a local filter to object tracking field. Sun and Liu proposed using a combination of the local description and global representation in object tracking [18]. Recently, a graphics model based on a Bayesian neural network was also applied to continuous object tracking [19]. However, if we applied these methods to spatial object tracking, spatial background clutter and moving object overlapping appeared in different scenes. The existing multiscale deviation will seriously reduce the spatial object tracking accuracy.

In this paper, we propose a spatial object tracking method using an enhanced mean shift method based on a perceptual spatial-space generation model. We detect the spatial object continuity and saliency between different scales in the perceptual spatial-space generation model. The enhanced mean shift considers the relevance between motion area and static background. It could achieve more robust object tracking. Our proposed method is shown in Figure 1. This paper is organized as follows. Section 2 describes the perceptual spatial-space generation model. Section 3 proposes an enhanced mean shift method. Section 4 shows experimental results. Section 5 is the conclusion.

2. Perceptual Spatial-Space Generation Model

Spatial images have a high amount of data structure, and their processing may not obey existing image processing model assumptions. In this paper, we propose a perceptual spatial-space generation model. It consists of two parts: a prototype pyramid, labeled , and a set of perceptual transform rules, labeled . The set is the Gaussian transform pyramid. Our goal is to set a common variable and a maximized posterior probability that can be used to calculate the priority prototype pyramid and the perceptual transform rules.

2.1. Prototype Pyramid Generation

Generation model is a joint probability function with prototype and image , and is a dictionary which includes image primitives, such as blobs, edges, crosses, and bars. It can be expressed as

The decomposition probability could be divided into primitives and texture.

Consider the following:

where is the properties graphics, is the collection of primitives in , denoting the primitives in the dictionary , and is the variance of the corresponding primitive features. The priority model is an uneven Gibbs model, which defines the graphical attributes on . It focuses on continuity properties in the perceptual model, such as smoothness, continuity, and typical functions.

Consider the following:

where is the primitive mark in and its connection degree is . ) is potential association for two correlation functions. Because there is uncertainty in the inner perception posterior probability, the prototype pyramid may not appear continuous for each layer calculation. In order to ensure transition and consistency of single frame diagram, we define a set of graphical operation factors

Graphical operation factors could synthesize detected graphic edges into pairs of characteristic bridges. Each bridge is related with the properties of probability function. The conversion from to is realized by a series of conversion rules , and rule order could directly determine conversion efficiency. The generative rule graphic path from to could be expressed as

where is the optimal-calculated prototype. Under the condition that there will be no loss in perception model accuracy, we assume that begins to decay from through the single operation factor. will gradually reduce the resolution, and will also have a related complexity. The posterior probability could be expressed as

A layered reduced perception model will not completely adapt to the complex model , so the first logarithmic ratio is often negative. The parameter is used to balance model fitness and complexity. If , we could launch simplified . could be decided in the following range:

The transform between and is achieved using a group of greed detection. The accurate scale of graphical operation factor will make differences based on the subjective goal. We suppose that the graphical operation factor is between and

Based on formulas (6) and (7), we determined that the corresponding interval is

2.2. Perceptual Transform of Prototype Pyramid

In this section, our goal is to determine the optimum conversion path and deduce the hidden graphics prototype. Our method is scanning the prototype pyramid from top to bottom based on each primitive learning decision rule. Our method can be divided into three steps.

Step 1 (prototype pyramid independent calculation). We apply a pyramid algorithm in the bottom of image and calculate the Gaussian pyramid . Because each prototype layer is calculated using a MAP estimation [20], there is a certain loss in the continuity of the prototype pyramid. The specific formula is as follows:

Step 2 (pattern matching from bottom to up). We match the image prototype attribute from to using an image registration algorithm from a previous study [21]. We use , , , and as a judgment function to process each node and obtain the related image characteristics at each time. Specifically, the matching degree between the th node at the th scale and the th node at the th scale can be expressed as
where is the variance of related features. For similarity matching between and , this formulation allows the empty variable prototype to appear in . We multiply by the related variance and obtain a homologous with subsidiary value. Pattern matching results will be used as the initial parameter for the Markov chain matching in the next step.
Consider the following:

Step 3 (Markov chain matching). Due to the uncertainty for the initial perception in the posterior probability model and the relative complexity for dynamic graphic structure mining, we use the reversibility of Markov chain matching to match a perceptual transform. The Markov chain includes 25 pairs of reversible jump. In each path of Markov chain matching, there is a reversible jump between and [22]. These reversible jumps are related to their corresponding grammars, and each pair of these rules is based on probability selection. We use this mechanism to optimize the perception conversion path, which could lead a cross-scale continuous perception prediction.
Consider the following:

2.3. Object Contour Evolution

The problem of such matching strategy is that it does not contain any matching, which could split a long contour into short edges. We propose a new Monte Carlo contour detection method. This method mainly chooses the right standard in certain scale using spatial-space domain knowledge. Our proposed method uses the weight set to estimate the posterior probability density . According to the resampling theory [23], it is feasible to calculate the sample specimen with appropriate weights from the normal density distribution .

Consider the following:

For the sequence sample, the important probability can be chosen using the following mode:

The entire probability can be approximated with a simplified formulation

We use only one part of the whole sample; ) simulates the interaction value between the two adjacent areas and . The local probability is the weight of the interactive area. In this paper, we use a sequence of Monte Carlo simulations to estimate the interactive part with its related estimation weight. The density importance function can be defined as follows:

Based on the resampling theory, the sampling weight can be updated as follows:

where is the sample retrieval in the th part, is the sample length of the th part, and is the sample set of the th part. As we use the Markov property, the density function could do further approximation relied on the unit product of local observation similarity unit .

Consider the following:

In our proposed framework, all parts of the detected object will be tracked at the same time. There is no need to calculate all unit probabilities , so we can use the local possibility and directly estimate the weight of all units.

2.4. Markov Random Field Representation

We propose Markov random field representation for perception generation model. We define the pixel perception set , in which any two arbitrary pixels are adjacent. The adjacent relation is an interactive relationship; in the an adjacent pixels system , if is adjacent point of , then is also adjacent with , and is the Markov random field with respect to the perception set

We can define as the pixels set that contains all pixels of . The Markov property mentioned in formula (20) relies on the density distribution of the adjacent pixels. According to S. Geman and D. Geman [24], the Markov random field-related pixels in system can be rewritten as the following Gibbs distribution:

where is the potential variable function defined in . is a conventional constant, which can maintain the sum up to 1. For spatial object tracking, the perception latent variables that appear in pairs are difficult to describe accurately. If has an amount of pixels, then will be a multidimensional function. Here, we use the topological model to solve the above problems. Assuming that the spatial perceptual rules set is , then could extract the local characteristics near pixels . A specific example is , . We transform the image with the Gibbs distribution

where is the low-dimensional image characteristic function set, is the conventional constant that depends on , and . Assuming that the normal distribution is , is the normal distribution of under . could guarantee . For each , the marginal distribution of is . With any given , the specific calculation of the axiom is as follows:

In the formulation of can be seen as the best approximation of probability , and it also can be marked as the “maximum likelihood.” From the minimized to the maximized , the marginal probability can be for any under . So .

Consider the following:

In order to model the observed image, we assume that , does not depend on . We can continue to parameterize the or normalize in a low-dimensional scale. If we normalize in and make , we could rewrite the Gibbs distribution as follows:

where represents the quantity of the effective points which falls into the interval and is the marginal matrix related to . If we want to find the maximum estimate coefficient , we need to calculate the spatial-scale statistical coefficient

In other words, we need to match the spatial data and the related model. The most suitable model is determined by , and is an average parameter; value is a natural parameter. In the perceptual model, there will be a global balance variable. We can make a minor adjustment

where is the approximation value of global observation estimated from observed spatial images. The local equilibrium parameter defined in any pixel will obey the specific distribution

where is local approximation for pixel . It is the only distribution that has similar effects with perceptual model.

3. Enhanced Mean Shift Method

3.1. Mean Shift

In this section, we propose an enhanced mean shift method. Our method uses the mathematical recursion method [25]. The spatial tracking object will be represented by a spatial histogram consisting of a weighted evaluation, in which the probability estimation function and are used to represent the potential motion probability in images and . The histogram variables can be expressed as follows:

where , , and are the representatives of weight function, and are motion estimations for positions and , is the box parameter, is normalized constant, , contains all the binary pixels, and and are the center of related kernel function. The Bhattacharyya coefficient will be used to detect similarity between tracking object area and the potential background.

Consider the following:

We apply the first-order Taylor sequence extension, in which are the coordinates of the center position in the previous frame, and then we obtain the following extended formulas:

The center of kernel function can be determined by the estimate of

In order to estimate the kernel function, the normalized bandwidth will be applied to a similarity judgment. The normalized bandwidth can be obtained by estimating

where could accurately determine . Equations (32) and (33) will be calculated in an alternative iteration until the estimated parameters can cover all the variables.

3.2. Estimation of Spatial Shape Parameters

We also use the iterated function to determine boundary parameter , which contains five fully adjustable affine box parameters. These parameters are the width, height, length, orientation, and center location. The orientation will be defined as the angle between the horizontal and width matrix. The box with width and height is the ellipse area between the long and short coordinate system. The relationship between , , and the bandwidth matrix can be expressed as follows:

These parameters can be calculated using the octave decomposition method [26]; the specific formula is as follows:

where and are the octave decomposition components of the previous frame and the current frame.

3.3. Spatial Object Tracking Using Enhanced Mean Shift

In order to limit the possibility that background pixels appear in the tracking object, we use a relatively small elliptical area, in which the contract domain is determined by the factor ; in our experiments, . The elliptical area can be defined as follows:

In our proposed method, we should determine whether the previous frame motion area (Object ) will be used to guide the next frame object (Object ) tracking. If the number of continuous characteristic pixels and the Bhattacharyya coefficient for are both higher, the initial rectangular tracking area for will refer to ’s settings, and the tracking area will be determined by the previous frame mean shift.

Consider the following:

where and determine the threshold value. In order to solve drift and error propagation, we use enhanced mean shift for frame resampling. The object tracking resampling operation can be summarized as follows:

where and are the four-dimensional motion areas in frames and and and are the threshold values determined by the graphic distance and the similarity shape.

4. Experimental Result

We conducted experiments on four different spatial video sequences, in which tracking objects are spatial satellites and aircrafts. We uniformed the sequence image size for the same spatial resolution; each frame is . The superiority of our proposed algorithm will be validated by an intuitive performance and objective evaluation.

4.1. Contour Evolution in Prototype Pyramid

Contour evolution experimental results are shown in Figure 2. Each prototype pyramid layer is calculated independently. The performances have shown that the contour evolution has a better continuity in the layer-by-layer prototype pyramid, and the approximate effects are derived from a perceptual spatial-space generation model and are closer to the human visual perception system.

4.2. Markov Random Field Representation

The Markov random field representation is shown in Figure 3. The significant representative region derivate from the perceptual spatial-space generation model contains significant feature information, which is closely related to the different color and texture distribution in the spatial area. The experimental results show that the region representation effect, which has been smoothed, could highlight the sensitivity of the motion area more than the initial representation.

4.3. Enhanced Mean Shift Object Tracking

We conducted experiments on the ten video sequences, in which the tracking objects include spatial satellite and aircraft, highway and park surveillance, and the human body. The enhanced mean shift conducted an iterative calculation 20 times on each video sequence. The normalized matrix bandwidth parameters are determined by a different experimental sample. In our experiments, it is 0.63 for video sequences 1, 2, and 3, 0.42 for video sequences 4 and 6, 0.56 for video sequences 5 and 8, 0.21 for video sequence 7, and 0.60 for video sequence 9 and 10.

4.3.1. Satellite-1, 2, and 3

Tracker-1: the particle filter will have a negative impact on the horizontal direction, and the motion estimation will appear noncontinuous. Tracker-2: the distance metric learning will affect motion area determination, and the rectangular window tracking will produce some deviation. Our proposed algorithm can better track the spatial objects, and satellites and aircraft can be completely contained in the rectangular window with a similar color distribution and background quiver. The edge deviation variance can be controlled well using spatial object detection, with no offset and blurring.

4.3.2. Automobile-1 and 2

In a nighttime environment, as the weak light and brightness, Tracker-1 and Tracker-2 obtain a vague tracking result, and the confusion area between background and the tracking object becomes larger. In the Automobile-2 sequence in particular, as the illumination from other automotive foreground lamps, Track-1 produces particularly serious deviation. Our proposed enhanced mean shift method could distinguish the tracking object from background confusion, and the tracking results do not appear to obviously deviate.

4.3.3. Highway

Tracker-1: this method will lose the tracking center in some frames. There are some cross-rectangular windows between the far and close vision sequences. Shape errors exist in the rectangle window estimations. Tracker-2: the tracking results also have a tracking deviation in the far vision. Our proposed method can maintain consistency between different visions and does not appear to have huge deviations.

4.3.4. Running, Walking

There are no obvious differences between the different methods. Except for some slow motion (walking body), Tracker-2 shows some partial deviations.

4.3.5. Automobile-3, Walking-2

These cases are used to further test the robustness of our proposed method in the scene containing two or more moving objects. We could observe that Tracker-2 results in less accurate boxes probably due to its poor estimate of different moving object center position in the same scene. The performance of Tracker-1 is somewhat better; however, it also produces partial deviations. Our proposed method is more robust when dealing with scenarios of two or more moving objects.

4.4. Objective Evaluation

We use three objective evaluations for evaluating the object tracking performance (Figure 4).

4.4.1. Euclidian Distance

The Euclidian distance is the distance between rectangular windows obtained by tracking methods and the artificially marked. The specific calculation is as follows:

where , , and are corner coordinates of the rectangular window calculated by tracking methods and artificially marked. Figure 5 shows the Euclidian distance between the tracked and artificially marked areas for our proposed method and the two other trackers on the videos.

The averaged Euclidian distance calculated on all videos is shown in Table 1. Compared to the distance values from Tracker-1 and Tracker-2, our proposed method clearly shows smaller and bounded Euclidian distances for the tested videos.

4.4.2. Mean Square Errors (MSEs)

We have

where , is the center of the tracking area obtained by methods and artificially marked, respectively. is the total number of video sequence frames. The experimental results are shown in Table 2 and, it can be seen that our proposed method has the minimum MSE, which means it has the lowest tracking deviation. Our method has obvious advantages compared to other methods.

4.4.3. Bhattacharyya Distance

The Bhattacharyya distance is used to judge the deviation degree between the tracking area and the actual motion area. The specific calculation method is as follows: and are the mean vectors with respect to the tracking area and are calculated by our method and artificially marked. The variables and are covariance matrices with respect to the tracking area and are calculated by our method and artificially marked.

Consider the following:

The Bhattacharyya distance between the tracked object area and the artificial marked region is shown in Figure 6. Among nine case studies, our proposed method has shown a marked improvement on the tracking accuracy as compared with the two existing trackers (Tracker-1 and Tracker-2). It is mainly due to our combination of perceptual spatial-space generation model and the enhanced mean shift. The averaged Bhattacharyya distance on different video is shown in Table 3. Our proposed method has the smallest average tracking deviation between different methods.

5. Conclusions

In this paper, we propose a new intelligent modeling method using the enhanced mean shift method based on a perceptual spatial-space model for spatial object tracking. The perceptual spatial-space model can obtain a continuous spatial object contour and highlight tracking object saliency. Enhanced mean shift method uses enhanced version mean shift, which focuses on the estimation of spatial shape parameters. The method could effectively cope with severe spatial interferences. The comparison between our method and other state-of-the-art methods demonstrates that our proposed method has a higher tracking accuracy and precision. In future research, we can incorporate more spatial object information, such as spatial textures and aircraft shapes, into our intelligent model to generate a more robust spatial object tracking method.

Acknowledgments

This work was supported by the National Basic Research Program of China (973 Program) (2012CB821206), the National Natural Science Foundation of China (no. 91024001 and no. 61070142), and the Beijing Natural Science Foundation (no. 4111002).