Abstract

This paper presents a new and compact 3D representation for nonrigid objects using the motion vectors between two consecutive frames. Our method relies on an Octree to recursively partition the object into smaller parts. Each part is then assigned a small number of motion parameters that can accurately represent that portion of the object. Finally, an adaptive thresholding, a singular value decomposition for dealing with singularities, and a quantization and arithmetic coding further enhance our proposed method by increasing the compression while maintaining very good signal-noise ratio. Compared to other methods that use tri-linear interpolation, Principle Component Analysis (PCA), or non-rigid partitioning (e.g., FAMC) our algorithm combines the best attributes in most of them. For example, it can be carried out on a frame-to-frame basis, rather than over long sequences, but it is also much easier to compute. In fact, we demonstrate a computation complexity of for our method, while some of these methods can reach complexities of and worse. Finally, as the result section demonstrates, the proposed improvements do not sacrifice performance since our method has a better or at least very similar performance in terms of compression ratio and PSNR.

1. Introduction

As the technology for graphics processing advances, so does the details in the 3D models used for animation. So, despite these advances, when storing, transmitting, or rendering such models, the need for fast and compact representations is the same as it was a few years ago. In that sense, two categories of systems can be found in the literature: the time-independent methods and the time-dependent methods.

The first and most traditional category is the time-independent, and it was first introduced in [1]. In that case, a 3D object is represented using its geometric properties at that moment. That is, 3D points [2, 3], triangular meshes, surface normals, edge orientations [47], wavelet coefficients [8, 9], and other features of the object are analyzed within a single time instant, or frame. These methods have advantages in terms of the quality of the model, but they are of course not very compact. On the other hand, time-dependent methods exploit the temporal relationship of these same types of features in order to increase compression. In one of the first systems to be reported [10], but also in more recent ones [1115], the basic idea is to encode the motion of the 3D object. That is, the difference between consecutive frames, rather than the properties of the object in the frames. In order to do that efficiently, the system must identify the parts of the object that are stationary from the parts that are moving, and describe only the latter.

This type of approach raises two major problems: () how to partition the two portions of the object—that is, moving and stationary portions; () how to describe the moving portions so they represent as perfectly as possible the non-rigid motion. Although much research has been done in this area [1113, 16, 17], these approaches still suffer from the following: () inaccurate motion transformations [17], () the need for extra space to store the partitioning [12], () the need for prior information on the entire sequence [1315], and so forth.

In this paper, we address the above problems by proposing a rigid partitioning of the 3D space combined with an affine transformation for motion capture. Some of the major advantages of our method are its computational efficiency, the compactness of the motion, and the space representation.

In Section 2, we will talk about some of the related work in compression of animated sequences. Here, we distinguish animation from real 3D data in particular for the fact that animation can rely on feature correspondence between frames. Section 3 contains the details of our approach and in Section 4, we compare our method against other celebrated methods in the literature. We show the advantages of using our representation for compression, but we also point out where it can be improved.

One of the first methods proposed for time-dependent 3D data compression can be found in [10]. From this paper, a new paradigm in time-dependent compression was established: represent successive frames by a small number of motion parameters and select a coding scheme to efficiently store the data. Even though many systems today offer completely different approaches to capture and parametrize this motion—from affine transformations [14, 15, 18], to principal components [13, 19, 20], and to wavelets that analyze the parametric coherence in the sequence [21, 22]—most time-dependent methods generally perform two basic steps: () partitioning of complex objects into smaller and simpler components; () description of the motion of these same components.

With regard to the partitioning method, we find systems using regular or rigid spatial partitioning, where vertices are divided according to their spatial location. One such example is the Octree [11, 16, 17, 23]. In this case, the space is recursively divided into 8 equal portions until some termination criteria stop the process.

On the other side of the coin, we find systems employing irregular partitioning. In [12], for example, the system employed the Iterative Closest Points algorithm (ICP) and assumed that the underlying motion between two consecutive frames followed a rigid transformation. The rigid transformation returned from the ICP was used to reconstruct the frame, while small and irregular portions of the object with small reconstruction error were clustered together to form a single rigid component. In [24], the clustering of the vertices was based on a method similar to a k-means, while the distance between clusters was defined as the Euclidean distance on the subspace defined by a principal component analysis (PCA). That means that the entire sequence had to be known beforehand in order to calculate the subspaces. The same can be said about other PCA-based methods [13, 19]—whether using irregular partitioning or not. Finally, in [20], several local coordinate frames were assigned to the object at the center of each cluster, and the vertices were assigned to clusters depending on their movements between consecutive frames. If the type of objects is restricted to, for example, the human body, the body parts can be clustered using their trajectories, as it was done in [25]. Actually, there is a third kind of systems where a spatial partitioning is not at all explicit. That is, in [26, 27], for example, vertices are grouped despite their spatial location, but rather based on their motion vectors.

After a partitioning is obtained, the next step is to find an efficient encoding for the motion. In that sense, some partitioning methods impose constraints on how the motion can be described. In other cases, however, the partitioning is generic enough that different motion descriptors can be used. In [11, 16, 17], all versions of the system employed a tri-linear interpolation, a regular partitioning (octree), and eight motion vectors attached to the corners of the cell. In other cases [12], the authors proposed an irregular partitioning with an affine transformation between clusters as the motion descriptor. Finally, as we mentioned earlier, PCA-based methods can achieve a good compression by storing only the principal components of the motion vectors, that is, a smaller dimension than the original one. That can be done both globally [13, 19] or locally [20, 24], but in either case, the entire sequence must be available for the calculation of the principal components. More recent approaches include the Principal Geodesic Analysis, a variant of the PCA method [28]; methods relying on prediction of the motion vectors, [26]; the replica predictor [27].

Another method relying on irregular partitioning is the recent Frame-Animated Mesh Compression (FAMC) [14, 15, 18], which was standardized within the MPEG as part of MPEG-4 AFX Amendment 2 [29]. In this case, the partitioning is irregular, but fixed. That is, the partitioning does not follow a rigid structure as in an octree, but it must be decided at the very first frame based on all frames in the sequence. While this allows for a more compact representation, it still requires prior knowledge of the entire sequence.

3. Proposed Approach

In many applications involving virtual reality and computer graphics, such as for human-robot interaction, real-time processing is a requirement that cannot be undermined. With that application in mind, we devised a fast and compact representation for 3D animation using rigid partitioning of the space and an affine transformation. As we will show in Section 4, our claim is that such combination can produce results comparable to or better than other methods in terms of compression, while still preserving a good signal-to-noise ratio.

3.1. Octree Structure

Our approach starts with the use of octrees for the partitioning of 3D objects. Octrees have been used in computer vision and computer graphics for many years [30]. This data structure has also been widely used in both time-independent methods, such as [2, 3], as well as time-dependent methods, such as in [11] and later improved in [16, 17, 23].

In our case, the partitioning using octree is similar to that in other time-dependent methods, however, the decision as to when partition and the termination criteria are different, making our method unique. That is, with octrees, the 3D space containing the object vertices is recursively divided into 8 subspaces, also known as the octants, cells, or cubes. In this paper, we will use the terms cube, cell, node, and octant interchangeably.

The partitioning starts with the application of an affine transformation and the calculation of an error measurement based on the motion vectors. If this error is too high, the cell is subdivided and the process repeats for each subcell. As for the termination criteria, we propose an adaptive thresholding of the reconstruction error followed by a singular value decomposition and quantization using arithmetic coding to further increase the compactness of the representation. All this process is simplified by rescaling (normalizing) all vertices to a size between [, ]—that is, the size of the root cube is always regarded as 1 unit.

3.2. Algorithm

Our algorithm consists of an encoding of the motion vector of the current frame with respect to the previous one. That is, the algorithm perform the following steps.

(1)First, it applies a tightly bounded cube around all vertices in the previous frame.(2)Next, it calculates the affine transformation matrix between all vertices in the bounding cube and the corresponding vertices from the current frame.(3)It checks for singularities of the affine and then it quantizes and dequantize the resulting affine matrix. This step is required in order to produce the reconstructed current frame and to calculate the error between the reconstructed and actual current frames.(4)If the error in the previous step is too large, the algorithm partitions the bounding cube into eight smaller subcubes and the steps (5) and (7) above are repeated for each of the subcubes.(5)Otherwise, it stores the quantized affine transformation as the motion vector for that cube.

The steps above are highlighted by the blue box in Figure 1(a).

Once a representation for the current frame is completed, the algorithm proceeds to the next frame. That is, it now uses the reconstructed current frame as the “previous” frame and the next frame as the “current” frame and steps are repeated until the last frame in the sequence is encoded. The idea is that only the positions of the vertices for the first frame are recorded and transmitted to the other side—in the case of 3D video streaming, for example, when frames are generated on one machine and rendered on another machine. After the first frame is transmitted, only motion vectors related to each cube of the octree are transmitted to the receiving end. In practice, in order to achieve better signal-to-noise ratios, intra frames could be inserted after an arbitrary number of frames to reset the error. However, in this paper we are interested in maximum compression only, and therefore, we will not offer any further discussion on how or when to insert intra frames.

The “dual” of the encoding algorithm described above is the decoding algorithm, and it is presented in Figure 1(b). Since this algorithm consists of the same (dual) parts of the steps of the encoder, we will leave to the reader to explore the details of the decoder.

3.3. Computation of the Affine Transformation

One of the main steps in our approach is the calculation of a single motion vector that will describe the movement of all the vertices in the cube with respect to two consecutive frames. Since the correspondence between vertices from two different frames is known, this motion vector can be approximated using an affine transformation whose reconstruction error can be expressed as

where is the total number of vertices in the cube, and is a 4 by 1 homogeneous vector with the coordinate of vertex in the previous frame. Similarly, is the homogeneous coordinates of the corresponding vertex in the current frame. In other words, the affine transformation is the actual motion vector between the vertices of a cube in the previous frame, and the corresponding vertices of the current frame.

Considering the entire structure of the octree, the total reconstruction error is the sum of all the errors at the leaf nodes of the tree. That is

where is the number of leaf nodes, is the number of vertices in the jth leaf node, and is the index of the ith vertex in that same leaf node.

In vector form, the homogeneous coordinates of the points in the leaf node , at the previous frame , are given by

and the corresponding coordinates at the current frame are given by

The affine that minimizes the error , that is, minimizes in the least square sense is given by a right pseudoinverse. That is

The matrix is a 4 by 4 matrix with as the last row. Since each pair of corresponding points provides three constraints, to solve for the unknowns in the system of equations above must be 4. Also, since the transformation between and is not a perfect transformation, the calculated leads to a reconstruction error . If is smaller than 4, no affine is calculated and the position of the vertices in that cube is transmitted instead.

3.4. Quantization and Singular Nodes

Each element of the affine transformation matrix is stored using integers, which affects the precision, but increases the compactness of the representation. To compensate for this loss of precision, a frame is encoded with respect to the reconstructed frame, rather than the actual frame . By doing so, the quantization error in the latter frame is corrected by the motion estimation for the current one. Therefore, quantization errors only affect the frame, but do not propagate throughout the whole sequence.

The quantized affine transformation matrix derived from the original affine transformation matrix A by

where is the quantization step. Also, in order to be able to compare our method with the method developed in [11], we set the same linear quantization method with a step of 16 bits. Ideally, and would be the minimum and maximum elements among all affine matrices. However, that would require the prior calculation of the motion vectors for the entire sequence. Instead, we use a predefined value for both and . This arbitrary choice is possible because, as we explained earlier, we normalize the dimensions of the root cube to . That guarantees that the elements of will only be large in the case of a singularity, for example, points are too close to each other. In that case, two things happen: () we apply a singular value decomposition (SVD) to solve for in (5); () we fix the reconstruction error to 5%. That is, when approximating the pseudoinverse by its SVD, we use only the eigenvalues corresponding to the first 95% of the principal components.

3.5. Termination Criteria

In Section 3.2, we explained how the algorithm stops at step (). However, there are actually two criteria for such termination.

The first criterion to stop the partitioning of the octree comes from the reconstruction error. That is, the maximum reconstruction error allowed for any single vertex is defined by

where and are the original and reconstructed vertices of the jth node.

In other words, if the reconstruction error of any single vertices exceeds ME, the node is partitioned into eight subcubes. Otherwise, the algorithm stops. In Section 4 we explain the choices of this threshold.

The second criterion to stop the partitioning is the number of vertices inside a cell. As we explained in Section 3.3, if that number is 4 or less, we store the coordinates of the vertices directly instead of the motion vectors (affine).

3.6. Algorithm Complexity

The complexity of our algorithm derives mainly from the pseudo inverse used during the calculation of the affine transformation. That is, for a matrix , the complexity of calculation the pseudo inverse is . Here, we will use as the asymptotic upper bound of ; the asymptotic tight bound of ; as the asymptotic lower bound of .

Hence, when we consider each step of the proposed algorithm in Algorithm 1, we arrive the following recursive equation:

Algorithm(node)
 Affine(node)        
 Quantization(Affine)    
 Reconstruction Error(node)        
 MaxError(vertices):        
 If Max Error less then Treshold 
  Stops
 Else
  Partition(node):   
 End
End

which can be further simplified as in:

where . This equation can be solved using the third case of the Master Theorem [31], where . That is, is polynomially larger than . Therefore, the solution for the recursion is .

3.7. Internal Representation of the Data

The final data consists of the following items. The first frame of the sequence with all uncompressed 3D coordinates of the vertices using single float-point precision: bytes. The quantized affine transformation matrices corresponding to the leaf nodes found in the octree for each frame in the sequence of frames: . If a leaf node contains less than four vertices, the coordinates of these vertices are used instead. Finally, one byte for level of each leaf node in the octree. To further increase the compression ratio, we apply an arithmetic encoding to this data.

4. Results

We compared our algorithm against three other methods in the literature. We will refer to these as the ZhangOwen's algorithm [11, 17]; the AmjStarb's algorithm [20]; the FAMC DCT algorithm [14, 15]. As we already explained in Section 2, the first method uses an octree with a tri-linear interpolation, while the second one uses a PCA-based approach, and the third uses an irregular partitioning also followed by an affine transformation.

4.1. Quantitative Measurement

In order to compare our method against these other approaches, we calculated the Peak Signal-Noise Ratio, or PSNR, as defined in [11]

where is the size of the largest bounding box among the different frames. Also, MSE is the mean-square error of the distance between reconstructed and actual vertices, that is, .

4.2. Benchmark Dataset

We applied the four algorithms—ZhangOwen's algorithm, AmjStarb's, FAMC DCT, and our method—to a total of five different benchmark sequences. Each of these sequences contains objects with different rigidity and number of vertices. Table 1 summarizes the properties of these sequences.

4.3. Quantitative Results

First, we compared our approach against ZhangOwen's and AmjStarb's algorithms in terms of Compression Ratio versus PSNR. Figure 2 shows the relationship between these two performance measures for three of the benchmark sequences. For detailed numerical data, the reader should refer to Tables 2, 3, 4, and 5. The points of the curves in Figure 2 for our method correspond to the different percentages in parenthesis in the tables, and they were obtained by controlling the choices of ME in (8). As for ZhangOwen's and AmjStarb's algorithms, the controlling parameters were, respectively, a similar maximum error threshold and the number of principal components (i.e., the number of bases)—as we explain later in this section.

As Figure 2 shows, our approach consistently outperforms ZhangOwen's algorithm, both in terms of compression ratio and PSNR. For AmjStarb's algorithm, our approach performs quite similarly, however, we must point out once again that a PCA-based approach requires prior knowledge of the entire sequence in order to achieve good results, while our method can be carried out on a frame-to-frame basis.

As the tables above imply, the compression ratio is highly dependent on the animation sequence. For example, for the “Ball” and “Chicken” sequences, which present a simpler motion, the compression ratio is also higher. In especial, for the “Ball” sequence, the entire set of 100 frames required only 99 matrices. That means that one single affine matrix was sufficient for each of these frames—no subnode was required under the root node of the octree.

Also, the value used for ME does not completely determine the PSNR, although they are highly correlated. For example, the Chicken sequence with an , our method achieved a PSNR of 32, while for the Cow sequence an turned out to reach a PSNR of only 26.35. We assume that this is because of the complexity of the motion, which still is the dominating factor in the final result.

Since the author did not provide the calculation for PSNR in [17], the data for some of the sequences was not available. However, the author did not make any changes between [11] and [17] that could have affected the PSNR. Therefore, we can assume that the PSNR would be the same for both methods in [11, 17].

Finally, for the PCA-based algorithm in [20], it is important to mention that we implemented their approach using the same parameters reported in their paper. Table 6 summarizes the results for four different numbers of principal components or bases, that is, 3, 5, 10, and 30.

After the tests above, we also compared our method to a state-of-the-art algorithm: the FAMC DCT [14, 15]. As we mentioned earlier, the FAMC is the standard method for the MPEG-4 AFX Amendment 2 [29] and it uses an irregular partitioning based on the topology of the mesh. That is, clusters of vertices are created by successively incorporating neighboring vertices that maintain the error in reconstruction from the affine transformation within a certain limit. This analysis of the error is performed over long stretches of the sequence, while a fixed partitioning, decided at the time of the first frame, is employed throughout the sequence. In Figure 3, we present a comparison between our method and the FAMC DCT for the first four benchmark sequences in Table 1. As the figure shows, our method outperforms the FAMC DCT for the Chicken sequence; it presents a performance similar to the FAMC DCT for the Snake sequence; it is outperformed by the FAMC + DCT for the other two sequences (Dance and Cow). The reason for that behavior comes mainly from the rigid versus non-rigid characteristics of the partitioning in both methods. That is, while the octree can lead to unnecessary partitioning of neighboring vertices with the same motion vectors simply because of the proximity of another group of vertices with a different motion, in a non-rigid partitioning as in the FAMC, vertices are grouped independently of the existence of this second group of vertices. That difference between the two approaches usually leads to a more efficient partitioning of the vertices by the FAMC. However, that same efficiency is compromised by long sequences—as it was the case for the Chicken, which has the largest number of frames in our tests (400)—and by radical motions of the vertices—as it was the case for the Snake, whose vertices move much more freely than in the other sequences. In other words, since the size and the number of clusters in the FAMC depend on those two factors (the number of frames and the nonrigidity of the motion of the vertices), the larger they become, the more likely it is for the FAMC to require more clusters with fewer vertices in each cluster, and consequently to present a reduction in compression ratio. This claim is supported by Figure 4, where we depict the performance of the FAMC DCT method for the Chicken sequence as a function of the number of frames used in the sequence. As the figure indicates, if we fix the PSNR, the compression ratio decreases as we add more frames to the sequence. It is important to notice however that the compression ratio used in those curves were obtained without taking into account the need to add a key frame at the end of each smaller sequence. This added reference frame is usually very large and even if a static compression is performed as suggested in [14], the cost of constantly including such key frames would lower even further the compression ratio. In order to perform the comparison in Figure 3, we chose to use all 400 frames in the Chicken sequence for the FAMC DCT algorithm.

4.4. Qualitative Results

We also perform a qualitative analysis of our algorithm for different parameters. However, before we present these results, we would like to illustrate what means visually for two algorithms to present a difference of 2 dB in PSNR. In that regard, Figure 5(a) presents one of the original frames in the Chicken sequence and the corresponding reconstructed frames achieved by our method and by the FAMC with different PSNRs. Our algorithm, in Figure 5(b), obtained a PSNR of 32 dB, while the FAMC in Figures 5(c) and 5(d) presented PSNRs of 30 dB and 34 dB, respectively. As the reader will notice, a change of 2 dB between (b) and (c) does not seem as pronounced as the same 2 dB difference between (b) and (d). This result is again to illustrate that the same 2 dB can appear quite differently to humans, and that some times the absolute number alone is not enough to determine which reconstructed frame looks “better.”

Finally, Figures 6, 7, and 8 depict various reconstructed frames for each of the three sequences: Chicken, Dance and Cow; using different values of ME.

5. Conclusion and the Future Work

We proposed an affine-based motion representation with adaptive threshold and quantization that uses a rigid octree structure for partitioning. Our experimental results indicated that the proposed algorithm is, in terms of compression ratio, superior to other octree-based methods, while it achieved similar performance when compared to PCA-based methods or the FAMC DCT algorithm. However, our method has a much smaller computation complexity and it does not require prior knowledge of the sequence. Instead, it can be carried out on a frame-to-frame basis.

We have demonstrated that the computation complexity of our algorithm for one frame is , where is the number of vertices in the frame. Once again, this is a major advantage of our method when compared to other algorithms such as ZhangOwen's algorithm [11, 17] and the AmjStarb's algorithm [20]. Moreover, while state-of-the-art methods such as the FAMC + DCT can be improved by means of a PCA technique (as described in [18]) and outperform the compression ratio of our method in all tested sequences, the price tag to pay in that case is a complexity greater than .

Both the PSNR and the compression ratios for our method were very high, and a choice of provided an excellent compromise between these two performance measurements.

One serious limitation of most time-dependent methods, including our method, is the requirement for correspondence between vertices in different frames. This prevents this method from being applied to real 3D data—cloud of points. In order to solve this problem, we propose to build a pseudocorrespondence between frames using the Iterative Closet Points algorithm [32, 33]. Another option would be the use of Robust Points Matching (RPM) [3436] however, we anticipate that this technique would lead to a much more complex algorithm. A combination of ICP and RPM was proposed [37], which have a much better performance, but still the run time is a concern, since the RPM is () and ICP is (). In the future, we intend to attack this problem of pseudocorrespondences.

Acknowledgments

The authors would like to thank Andrew Glassner for giving them access to Chicken data; Aljoscha Smolic, Nikolce Stefanoski, and Kivanc Kose for sharing the Chicken data; Hector Briceno for sharing the Cow, Ball, Snake, and Dance data. They would also like to thank Jinghua Zhang for explanations provided regarding their papers.