This paper presents an innovative SIFT-based method for rigid video object recognition (hereafter called RVO-SIFT). Just like what happens in the vision system of human being, this method makes the object recognition and feature updating process organically unify together, using both trajectory and feature matching, and thereby it can learn new features not only in the training stage but also in the recognition stage, which can improve greatly the completeness of the video object’s features automatically and, in turn, increases the ratio of correct recognition drastically. The experimental results on real video sequences demonstrate its surprising robustness and efficiency.

1. Introduction

In recent years, security surveillance systems being called “sky-eye” and hand-held video cameras have increasingly grown in popularity, and the need of applications such as video-based object recognition and tracking or retrieval [1, 2] goes up rapidly. However, to identify a (identical) 3D object in videos or image sequences is still a challenging problem mainly because a 3D object’s visual appearance may be different due to viewpoint or lighting changes.

For example, in Figure 1, there is a series of frame images captured from a video clip in which the vehicle is turning. If only (e) is used as the training image, then the vehicles in (h) or (t) would not be correctly recognized due to the changing of the viewpoints. Even for a human being with high sense of responsibility, just only providing him with (d) as the training image, he cannot confirm also that the cars in (h) or (t) are identical to the one in (d). However, when browsing the video clip, the source of Figure 1, everyone having normal cognitive ability can tell that the cars in (a) ~ (t) are all identical easily. Why?

With the help of selected regions in Figure 1, let us describe briefly what happens in the video browsing process of human being to serve as a modest spur to introduce the novel method proposed in this paper to come.

Before browsing the video clip, someone else, who has known the target object, should tell us which object is the target, for example, the selected region in frame image (a). Then we focus on the target object and try to dig its typical features out for saving—this part corresponds to the feature initialization of the target object. After that we keep going on to the next frame. What will we do with it? Firstly, we try to judge whether the target object is in it or not. Secondly, if the target objects are recognized in the frame, then we judge whether any new features of the target arise or not. However, how do we do that? Do we just only use incomplete target object features extracted just before? The answer is no. For our human beings, we do it by using both features and moving trajectories matching together. For example, comparing to the selected region in frame (a), we can read that although frames (b), (c), and (d) are not continuous, the cars in them not only share many features, but also keep going on the same turning trajectory, so a conclusion can be drawn that cars in them are all identical with high probability. Meanwhile, we can find that regions in them, being surrounded by yellow line, seem to be similar and moreover keep moving on a similar trajectory, so we can conclude that they are part of the target car with high probability also. Then, new features can be extracted from the regions and be used for further recognition.

Processes of object recognizing and feature updating are being executed alternately and iteratively, and eventually all vehicles would be recognized and most of the new distinctive features would be extracted and saved.

Inspired by this physiological process of human being, a novel method for rigid video object recognition is proposed in this paper; its main contribution lies in the following.(1)Modeled on the human recognition system, it makes the object recognition and feature updating processes organically unify together, which means that feature extraction and updating can be done not only in training stage, but also in recognition stage, which can improve greatly the completeness of features of the target video object and can in turn increase the ratio of correct recognition dramatically.(2)Its object recognition is based on models of both feature and trajectory matching, which improve greatly the accuracy of the identification.(3)Even if provided with only a single training image, it can create a relatively complete model of the target 3D object, using multiple 2D views automatically.

This paper is organized as follows. In Section 2, related researches are reviewed. In Section 3, the initialization of the target video object’s feature database is given. In Section 4, feature point’s trajectory is discussed and the iterative object is recognized, and then feature updating process is described. The experimental setup is presented and the analysis of the results is given in Section 5. Finally, in Section 6, conclusions are drawn.

There are lots of researches for 3D object recognition in which most of them are to model 3D objects using multiple 2D views. For example, in [3, 4], a method is proposed for combining multiple images of a 3D object into a single model representation; however, this approach requires that the single target object should occupy the majority of each of the training view images, which makes it meaningless in practice; the other most primary approaches are to get and describe the object’s stable surface features from 2D view images, such as color feature [5], texture feature, shape feature [6], and contour feature [7].

Another direction of research is to use automatic background segmentation [2, 8], which digs the moving objects out from the scenes first and then does recognition. However this can only work well with videos in which there is no dramatic background changing.

Motivated by biological vision systems, as this paper is, some simulating approaches are proposed in the long history of research. In [9], a method is proposed in which a scene is analyzed to produce multiple feature maps which are combined to form a saliency map which is used to bias attention to regions of the highest activity. And then some adjustments and improvements have been suggested to [9] by Itti et al. [10, 11].

Our work seemingly shares some themes with [12, 13] in the fact that features are learned from image sequences and/or video frames automatically. However, there are major differences between them; that is, in [12], its goal is to optimize the parameters of the known features in the tracked patches, not to learn new features of the targets; for example, for car detection, it can only extract and track features in the fixed manually labeled car’s image area and optimize parameters of the known features in corresponding different frame, but it cannot learn new features of the target car from the correlations between the video frames, and in [13], its goal is to utilize offline feature tracking to observe feature transformations under large camera motions and then one can construct the database accordingly, keeping fewer images of each scene than would otherwise be needed; that is, its essence lies in using feature transformation to reduce storage costs, not feature learning either.

3. Feature Database Initialization

In this paper, the target object’s feature database is organized as a set of feature models which represent different views of the target object, and each model consisted of all stable features extracted from corresponding object views which is contained in corresponding frame image. The relationship between frame image, object view, and feature model is shown in Figure 2. Furthermore, feature models are linked with each other by their sharing features. For example, in Figure 1, the vehicle views in frame images (a) ~ (d) seem to be similar and share many features, so four feature models may be built in feature database accordingly and may be linked with each other by their sharing of features. Of course, (b) and (c) seem to be identical, and maybe only one feature model needs to be built according to them. How do we estimate the degree of the similarity between object views and then decide how many feature models should be created accordingly? Refer to Section 4 please.

To initialize the feature database, it is needed to specify the target object firstly, that is, to specify the representative region of the target object firstly and then to initialize the object’s feature database with it.

3.1. To Specify the Target Object

However, as shown in Figure 3, there are six selection modes; then which type is better or the best?

Before drawing a conclusion, we should figure out the meaning of the selected region to the novel recognition method thoroughly. Firstly, the region is the representative of the target object and the feature points in it are feature seeds for the next stage, so all pixels selected should belong to the target object; meanwhile, the more pixels to be selected and the more distinctive they are, which means more potential feature points, the better. Secondly, all feature points in the selected region are used to recognize the object views in the next frames, so the more the duration of the region keeping in the next frames, the better.

Based on the consideration above, the selection modes are all desirable except (b), meanwhile, of all the option, mode (e) and (f) seems to be the better, mode (c) seems to be the best. However, the mode (c) seems to be too complicated to operate, and the extraction of the point features in (c), (d), and (f) involves the operation of padding with zero pixels, so taking the convenience of the select operation and the computation complexity into account additionally, we prefer the mode (e).

3.2. To Initialize Feature Database

After specifying the representative region of the target object, to allow for efficient matching between the region and frames in video clips (namely, the training or being searched video clips), the selected region and the video frames should be represented as a set of stable features, using some kind of feature descriptor.

In this paper, RGB-SIFT [15] is adopted to compute features. For the RGB-SIFT descriptor, SIFT descriptor is computed for each of the three color channels (R/G/B), respectively. To see the details of the transformation process, refer to [15] please.

As Figure 4 showed, after the processing with the above steps of RGB-SIFT, the selected region is transformed into stable features and saved into a feature model in the target object’s feature database directly. Meanwhile, each of the frames in videos is transformed into feature view and saved into the temporal feature database, respectively.

Each SIFT feature, along with a record of the location, orientation, and scale, represents a vector of local image measurements in a manner that is invariant to scaling, translation, changes in illumination of the scene, and limited rotation of the camera’s viewpoint. The size of frame image region that is sampled for each feature can be varied, but the experiments described in this paper all use a vector of 3 × 128 samples for each feature to sample 8 gradient orientations over a 4 × 4 sampling region in each color channel image. A typical frame image may produce several thousand overlapping features at a wide range of scales that form a redundant representation between the adjacent frame images.

4. Video Object Recognition Accompanied by Feature Database Updating

After the initialization stage described in the above section, in the object’s feature database, there is only one feature model which consisted of all feature vectors extracted from the selected region. Obviously, one feature model is not enough for video object recognizing, and something should be done to enrich the feature database. Just like human being does, the process of video object recognition proposed in this paper is accompanied by object feature updating, and its main work includes the following.(1)Video object recognition: to recognize the target object views in video frames, using both feature and trajectory matching.(2)Feature database updating: to enrich feature models and features in it with different views of the target object contained in corresponding frames, using both feature and trajectory matching also.

The initial feature model and features in it can be used as seeds to do these works.

4.1. Feature Point Trajectory Matching

Thankfully, according to affine camera model [16, 17], each of the feature points of the target object may be a trajectory point; that is, the motion trajectory of the target object is shown as its feature point trajectories.

That is, to suppose that feature points have been tracked between frame views and , then , the corresponding affine transform matrix, can be worked out and , the average error of trajectories matching between them, can be estimated.

The point is that can be used to evaluate the similarity between object views and models, which is the foundation for object recognition and feature database updating with corresponding feature views in this paper.

4.2. Feature Keypoint Matching

A set of matching feature keypoint pairs can be gained by efficient feature matching between feature models and feature views. This is the base for recognizing candidate video object views in corresponding frame images. According to [18], the best candidate match for each feature keypoint is found by identifying its nearest neighbor which is defined as the keypoint with minimum Euclidean distance for the invariant feature vector.

No algorithm is known for being any more efficient than exhaustive search in identifying the exact nearest neighbors of points in high-dimensional spaces. Our RGB-SIFT keypoint descriptor has a 3 × 128-dimensional feature vector; therefore, we have used an approximate algorithm, called the Best-Bin-First (BBF) algorithm [19]. This is approximate in the sense that it returns the closest neighbor with high probability. To see the details of keypoint matching, refer to [19] please.

This feature matching process between feature models and views is expressed as function feature point match (feature models, feature view, etc.) in this paper.

4.3. Notations Defined Specifications

For the sake of describing the procedure of video object recognition and feature updating conveniently, we adopt the following notations:(1): the th frame image in video;(2): the feature database of the target object in which all the known feature models are saved;(3): the temporal feature database in which all the feature views are saved;(4): the th feature view in ;(5)Model: the th feature model in ;(6)SSW: the scalable sliding window in which feature views being recognized with an identical feature model are saved temporally;(7), : feature sets in which the matching features in the matched feature model and feature view are saved, respectively;(8): the dimension of the area occupied in the frame view by the known feature points in . The value of it is approximated by the diameter of the minimum circle which covers all the feature points in . The circle can be gained by using the Hough transform method [20] (see Figure 5);(9): the distance of one candidate feature point to the known region of the target object in frame . Its value is approximated by its Euclidean distance to the nearest keypoint in (see Figure 5).

4.4. Video Object Recognition in One Frame

Feature models in the object’s feature database represent corresponding views of the target video object. According to [3], they can be used to recognize similar views of the object over a range of rotations in depth of at least 20 degrees in any direction.

Now, to suppose that a feature view has matched with a feature model model, using the feature matching procedure described in Section 4.2, and the matching features (≥3) between them are already saved, respectively, in , , then the procedure to recognize the target video object in can be described in Algorithm 1.

(1)   If then
(2)   {
(3)   Set //to eliminate feature points without matching with any feature
     points in to ensure the similarity property between the recognized object views next
(4)   Set
(5)   Set recognized false
(6)   Set a threshold value //for example, 0.8
(7)   Set a threshold value //for example, 0.8
(8)   Calculate with feature points   //to estimate the dimension of the target
      object in , using the Hough Transform method with feature points in
(9)    the number of feature models linked by features in
(10) For each of feature models linked with features in do
(11) {
(12)  Calculate with feature points in   //to estimate the dimension of the target
      object in the feature model.
(13)   the number of feature keypoints in the minimum circle determined by
(14)  Calculate //to estimate the residual error which shows the degree of the similarity
      between views that and implying, denoted by and respectively
(15)  If and then
(16)  {
(18)  }
(19) }
(20) If then recognized true
(21) Ouput
(22) }
(23) Return ,

4.5. Feature Database Updating

Once object views are recognized in corresponding frame images, then the process of feature database updating with them can be started immediately. The video object feature database consisted of feature models in which features are extracted from corresponding object views. Then the procedure of feature database updating includes two different level aspects.(1)Feature models updating: to enrich feature models with different views of the target object contained in corresponding feature views.(2)Model features updating: to enrich features in feature models with new features found in the corresponding similar feature views.

4.5.1. Feature Model Updating

To update feature models, it is required that at least one similar view of the target object is recognized in corresponding feature view with a specific feature model.

Now, the feature view is recognized with a feature model, and the matching feature points between them are already saved, respectively, in , , and then the procedure to enrich the feature model with can be described in Algorithm 2.

(1)   Calculate   //to estimate the residual error which shows the degree of the similarity
   between the object feature models that and implying, denoted by and
(2)   set   //to set a lower limit to the degree of their similarity between
(3)   if then
(4)   {
(5)     New(model( ), ) //to create a new feature model in   and save features in
      into it
(6)     Link(model( ), model( )) //to link to withall matching
      features between them
(7)   }
(8)   Else
(9)   {
(10)   Combine(model( ), model( )) //to combine with , which means
       the new features from should be added to the existing model
(11) }
(12) Endif
(13) return

4.5.2. Model Features Updating

To update features in corresponding feature models, it is required that at least two similar views of the target video object are recognized.

Also, to suppose that feature views and are recognized with a feature model in , and the matching features between them are already saved, respectively, in , , , then the procedures to enrich features in corresponding models can be described in Algorithm 3.

(1)   Calculate with ,   //to calculate the column vector determined by the transform
    matrix with matching feature point pairs in
(2)   Calculate // to estimate the residual error between and , subjecting
    and feature point pairs in to equation
(3)   If then //To determine whether there is a perceptible change
    between and or not
(4)   {
(5)     FeaturepointMatch(Output: output1, output2; Input: , ) //to match features between
       and and the matching feature keypoints are all saved temporally in Output1and Output2
       respectively except features in and
(6)     For to Num(FeaturepointMatch.Output) do //for each matching feature keypoint pair
        in the Outputs
(7)    {
(8)     Calculate , //to estimate the corresponding distance of this candidate feature points to
      the known area of the target object in and , respectively
(9)     Calculate //to evaluate the degree of agreement
      between this feature point pair. , represent locations of the candidate matching
      feature point in , respectively
(10)   If and and then
(11)  {
(12)  AddFeaturetoModel(( , model( ); , model( )) //if its distance is less than 0.25 times the
  projected diameter of the known area of the target object in corresponding frame view and the
  residual error of the projection is less than 0.85 times the average residual error, then add them into
  corresponding feature models, however, they will be discarded if already existed. By the way, the two
  models may point to a same feature model
(13)  }
(14)  }
(15) }
(16) Return

Why is the constant multiplier in line 12 of Algorithm 1, line 2 of Algorithm 2, and lines 3 and 10 in Algorithm 3 0.25, 0.05, 0.25, and 0.85, respectively? According to [21], if we imagine placing a sphere around an object, then rotation of the sphere by 30 degrees will move no point within the sphere by more than 0.25 times the projected diameter of the sphere, and for the examples of typical 3D objects used in [16, 21], an affine solution works well with allowing residual errors up to 0.25 times the maximum projected dimension of the object. In addition, is less than the actual projected diameter of the target object in generally, so the constant multipliers adopted in this paper would work well too. Thankfully, the experimental results support them.

4.6. General Procedure of the Video Object Recognition

So far, methods for recognizing the target video object in one frame and updating the feature database with one or two frames have been described in the above paragraphs.

So, we can write out the integrated cyclic procedure of recognizing the video object accompanied with updating the feature database briefly in Algorithm 4.

(1)    //to initiate the loop variable
(2)   While Do //to begin the updating loop
(3)    {
(4)    Set ,
(5)    For to Num( ) do   represents the number of frame views in .
(6)    {
(7)     Featurepointmatch(Output: , ; Input: , ) //to gain matching feature
       keypoint pairs between the feature view and feature models
(8)     Recognizinginfrmae(output: recognized or not, , ; input: , )
       //to confirm whether the frame image contain the target object or not.
(9)     If recognized then
(10)  {
(11)     ,   //to save the corresponding feature views of the frame image
   into the scalable sliding window, as Figure 6 shown.
(12)    FModelUpdating(Output: ; Input: , , , ) //to update feature models
   in with recognized feature view
(13)  }
(14)  }
(15)    If Empty( ) then //to judge the scalable sliding window is empty or not
(16)     {
(17)      Break //to jump out the loop
(18)     }
(19)    Else
(20)    {
(21)      For to do
(22)     {
(23)       For to do
(24)    {
(25)     FFeatureUpdating(Output: ; Input: , , )
(26)    }
(27)       Delete( , ) //to delete feature view from the temporal feature database
(28)     }
(29)   }
(30)    Dump( ) //to empty
(31)  }
(32) return

After this cyclic procedure of recognizing and then updating, most of the frames containing the target object can be recognized.

5. Experimental Results

The RVO-SIFT method with recognizing and then updating mechanism has shown its better abilities in experiments. The following experimental results are obtained on a computer with AMD Athlon 64 X2 2.6 GHz processor and 4G memory.

In order to fully demonstrate the ability of RVO-SIFT to acquire new features of the target video object, which is the key contribution of this paper, we use an about 2-minute-long surveillance video clip as the training video in which a Renault Megane comprehensive performance is testing and another about 20-minute video clip in which the testing vehicle is running on the highway as the target video to be recognized in, and due to that almost every perspective of the running vehicle exists in the video.

Figure 7 shows that the number of features in the feature database varies as a function of the number of training frame images containing the target video vehicle in the surveillance video with RVO-SIFT and classic RGB-SIFT, respectively. With RVO-SIFT, it can be seen that the number of feature keypoints does increase with the increasing of the number of training frame images. However, after the completeness of the feature database having increased to a considerable degree, the contribution of frame images reduces relatively. Meanwhile, with classic RGB-SIFT, it only extracts features in the specified region which is shown in Figure 1(a), and the size of the feature database keeps invariable.

In order to show how much the feature number affects the outcome of the recognition, the process to recognize the target vehicle in the target video is performed. Figure 8 shows the ratio of the correctly recognized vehicles in frames as a function of the number of feature keypoints in feature databases. We can read from the graph that the ratio of correctly recognized objects increases obviously with the increasing number of feature keypoints in database and, meanwhile, the growth rate slows down when the completeness of feature database reaches a certain degree. Some view images recognized correctly are shown in the top line of Figure 10.

Of course, accompanied by the feature database updating process, the recognition process of the RVO-SIFT consumes much more computation time. However, its average delay time is affordable. The experimental results are shown in Figure 9.

In order to show the generality of the RVO-SIFT, the recognition is performed additionally in the micromovie “The New Year, The Same Days” in which the face of the wife character is recognized and a trailer video for a blue and white porcelain, with which a beautiful chinaware is to be recognized in; the results are shown in the two lines below in Figure 10.

As shown in Figure 10, the RVO-SIFT performs well with real-world videos. Even human faces without exaggerated facial expression changes can be recognized correctly with relatively high rate, just as shown in the middle line. Furthermore, experimental results also show that the RVO-SIFT even can tolerate local camouflages, which is a basic but wonderful ability of human beings, due to the fact that the features of the camouflaged region would be added into the feature database by recognizing and then updating procedure gradually and then they play their roles in recognition subsequently.

6. Conclusion and Future Work

The RVO-SIFT, in which the novel recognizing and then updating mechanism is adopted, is particularly not only a wonderful rigid video object recognizer but also a wonderful feature automatic extractor for rigid video objects. It mixes processes of the object recognizing and feature studying together, just like what human being does in recognition process. It can improve greatly the completeness of the feature database of the target video object automatically and in turn increases drastically the ratio of correctly recognized objects consequently, at the expense of the more affordable millisecond level computation time. In addition to rigid video object recognition, its other potential applications include rigid video motion tracking and segmentation and any others that require feature extraction of the rigid targets in videos or image sequences.

However, RVO-SIFT is based on rigid video object theoretically and experimentally in this paper, so one of the directions for further research is to try to apply it to semirigid video objects, such as video face recognition with exaggerated facial expression changes.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


This work is supported by the National Nature Science Foundation of China under Grant no. 61163066.