Abstract

A common approach to produce visual speech is to interpolate the parameters describing a sequence of mouth shapes, known as visemes, where a viseme corresponds to a phoneme in an utterance. The interpolation process must consider the issue of context-dependent shape, or coarticulation, in order to produce realistic-looking speech. We describe an approach to such pose-based interpolation that deals with coarticulation using a constraint-based technique. This is demonstrated using a Mexican-Spanish talking head, which can vary its speed of talking and produce coarticulation effects.

1. Introduction

Film, computer games, and anthropometric interfaces need facial animation, of which a key component is visual speech. Approaches to producing this animation include pose-based interpolation, concatenation of dynamic units, and physically based modeling (see [1] for a review). Our approach is based on pose-based interpolation, where the parameters describing a sequence of facial postures are interpolated to produce animation. For general facial animation, this approach gives artists close control over the final result; and for visual speech, it fits easily with the phoneme-based approach to producing speech. However, it is important that the interpolation process produces the effects observed in the natural visual speech. Instead of treating the pose-based approach as a purely parametric interpolation, we base the interpolation on a system of constraints on the shape and movement of the visible parts of the articulatory system (i.e., lips, teeth/jaw, and tongue).

In the typical approach to producing visual speech, the speech is first broken into a sequence of phonemes (with timing), then these are matched to their equivalent visemes (where a viseme is the shape and position of the articulatory system at its visual extent for a particular phoneme in the target language, e.g., the lips would be set in a pouted and rounded position for the /u/ in “boo”), and then intermediate poses are produced using parametric interpolation. With less than sixty phonemes needed in English, which can be mapped onto fewer visemes since, for example, the bilabial plosives /p/, /b/, and the bilabial nasal /m/ are visually the same (as the tongue cannot be seen in these visemes), the general technique is low on data requirements. Of course, extra postures would be required for further facial postures such as expressions or eyebrow movements.

To produce good visual speech, the interpolation process must cater for the effect known as coarticulation [2], essentially context-dependent shape. As an example of forward coarticulation, the lips will round in anticipation of pronouncing the /u/ of the word “stew,” thus affecting the articulatory gestures for “s” and “t.” The de facto approach used in visual speech synthesis to model coarticulation is to use dominance curves [3]. However, this approach has a number of problems (see [4] for a detailed discussion), perhaps the most fundamental of which is that it does not address the issues that cause coarticulation.

Coarticulation is potentially due to both a mental planning activity and the physical constraints of the articulatory system. We may plan to over- or underarticulate, and we may try to, say, speak fast, with the result that the articulators cannot realize their ideal target positions. Our approach tries to capture the essence of this. We use a constraint-based approach to visual speech (first proposed in [4, 5]), which is based on Witkin and Kass’s work on physics-based articulated body motion [6]. In [7], we presented the basics of our approach. Here, we show how it can be used to produce controllable visual speech effects, whilst varying the speed of speech.

Section 2 will present an overview of the constraint-based approach. Sections 3, 4, and 5 demonstrate how the approach is used to create Mexican-Spanish visual speech for a synthetic 3D head. Section 3 outlines the required input data and observations for the constraint-based approach. Section 4 describes the complete system. Section 5 shows the results from a synthetic talking head. Finally, Section 6 presents conclusions.

2. Constraint-Based Visual Speech

A posture (viseme) for a phoneme is variable within and between speakers. It is affected by context (the so-called coarticulation effect), as well as by such things as mood and tiredness. This variability needs to be encoded within the model. Thus, a viseme is regarded as a distribution around an ideal target. The aim is to hit the target, but the realization is that most average speakers do not achieve this. Highly deformable visemes, such as an open mouthed /a/, are regarded as having larger distributions than closed-lip shapes, such as /m/. Each distribution is regarded as a constraint which must be satisfied by any final speech trajectory. As long as the trajectory stays within the limits of each viseme, it is regarded as acceptable, and infinite variety within acceptable limits is possible.

To prevent the ideal targets from being met by the trajectory, other constraints must be present. For example, a global constraint can be used to limit the acceleration and deceleration of a trajectory. In practice, the global constraint and the distribution (or range) constraints produce an equilibrium, where they are both satisfied. Variations can be used to give different trajectories. For example, low values of the global constraint (together with relaxed range constraints) could be used to simulate underarticulation (e.g., mumbling). In addition, a weighting factor can be introduced to change the importance of a particular viseme relative to others.

Using the constraints and the weights, an optimization function is used to create a trajectory that tries to pass close to the center of each viseme. Figure 1 gives a conceptual view of this. We believe that this approach better matches the mental and physical activity that produces the coarticulation effect, thus leading to better visual speech. In using a constrained optimization approach [8], we need two parts: an objective function and a set of bounded constraints ,

where and are the lower and upper bounds, respectively. The objective function specifies the goodness of the system state for each step in an iterative optimization procedure. The constraints maintain the physicality of the motion.

The following mathematics is described in detail in [4]. Only a summary is offered here. The particular optimization function we use is The objective function uses the square difference between the speech trajectory and the sequence of ideal targets (visemes) , given at times . The weights are used to give control over how much a target is favored. Essentially, this governs how much a target dominates its neighbors. Note that in the presence of no constraints, will have no impact, and the will be interpolated.

A speech trajectory will start and end with particular constraints, for example, a neutral state such as silence. These are the boundary constraints, as listed in Table 1, which ensure the articulators in the rest state. If necessary, these constraints can also be used to join trajectories together.

In addition, range constraints can be used to ensure that the trajectory stays within a certain distance of each target,

where and are, respectively, the lower and upper bounds of the ideal targets .

If (3) and Table 1 are used in (2), the ideal targets will simply be met. A global constraint can be used to dampen the trajectory. We limit the parametric acceleration of a trajectory. and is the maximum allowable magnitude of acceleration across the entire trajectory. As this value tends to zero, the trajectory cannot meet its targets, and thus the in (2) begins to have an effect. The trajectory bends more towards the target, where is high relative to its neighbors. As the global constraint is reduced, the trajectory will eventually reach the limit of at least one range constraint.

The speech trajectory is represented by a cubic nonuniform B-spline. This gives the necessary continuity to enable (4) to be applied. The optimization problem is solved using a variant of the sequential quadratic programming (SQP) method (see [6]). The SQP algorithm requires the objective function described in (2). It also requires the derivatives of the objective and the constraints functions: the Hessian of the objective function and the Jacobian of the constraints . This algorithm follows an iterative process with the steps described in (5). The iterative process finishes when the constraints are met, and there is no further reduction in the optimization function (see Section 5 for discussion of this):

3. Input Data for the Range Constraints

In order to produce specific values for the range constraints described in Section 2, we need to define the visemes that are to be used and measure their visual shapes on real speakers. In English, there is no formal agreement on the number of visemes to use. For example, Massaro defines 17 visemes [9], and both Dodd and Campbell [10], as well as Tekalp and Ostermann [11] use 14 visemes. We chose 15 visemes for Mexican-Spanish, as listed in Table 2.

Many of the 15 visemes we chose are similar to the English visemes, although there are exceptions. The phoneme /v/ is an example, where there is a different mapping between Spanish and English visemes. In English speech, the phoneme maps to the /F/ viseme, whereas in Spanish, the /v/ phoneme corresponds to the /B_M_P/ viseme. There are also letters, like /h/, that do not have a corresponding phoneme in Spanish (they are not pronounced during speech) and thus have no associated viseme. Similarly, there are phonemes in Spanish that do not occur in English, such as /ñ/, although there is an appropriate viseme mapping in this example to the /N/ viseme.

To create the range constraints for the Mexican-Spanish visemes listed in Table 2, three native Mexican-Spanish speakers were observed, labeled Person A, Person B, and Person C. Each was asked to make the ideal viseme shapes in Mexican-Spanish, and these were photographed from front and side views. Figure 2 gives examples of the lip shapes for the consonant M (labelled as B_M_P in Table 2) and for the vowel A for each speaker, as well as the modeled synthetic head (which was produced using FaceGen www.facegen.com). Figure 3 shows the variation in the lip shape for the consonant M when Person B pronounces the word “ama” normally, with emphasis and in a mumbling style. This variation is accommodated by defining upper and lower values for the range constraints. Figure 4 illustrates the issue of coarticulation. Person B was recorded three times pronouncing the words “ama,” “eme,” and “omo,” and the frames containing the center of the phoneme “m” were extracted. Figure 4 shows that the shape of the mouth is more rounded in the pronunciation of “omo” because the phoneme m is surrounded by the rounded vowel o.

4. the System

Figure 5 illustrates the complete system for the Mexican-Spanish talking head. The main C++ module is in charge of communication between the rest of the modules. This module first receives text as input, and then gets the corresponding phonetic transcription, audio wave, and timing from a Festival server [12]. The phonetic transcription is used to retrieve the relevant viseme data. Using the information from Festival together with the viseme data, the optimization problem is defined and passed to a MATLAB routine, which contains the SQP implementation. This returns a spline definition and the main C++ module, then generates the rendering of the 3D face in synchronization with the audio wave.

Each viseme is represented by a 3D polygon mesh containing 1504 vertices. Instead of using the optimization process on each vertex, the amount of data is reduced using principal component analysis (PCA). This technique reconstructs a vector that belongs to a randomly sampled vector population using (6) where is the mean vector, are the eigenvectors obtained after applying the PCA technique, and are the weight values. With this technique, it is possible to reconstruct, at the cost of minimal error, any of the vectors in the population using a reduced number of eigenvectors and its corresponding weights . To do the reconstruction, all the vectors share the reduced set of eigenvectors (PCs), but they use different weights for each of those eigenvectors. Thus, each viseme is represented by a vector of weight values.

With this technique, the potential optimization calculations for 1504 vertices are reduced to calculations for a much smaller number of weights. We chose 8 PCs by observing the differences between the original mesh and the reconstructed mesh using different numbers of PCs. Other researchers have used principal components as a parameterization too, although the number used varies from model to model. For example, Edge uses 10 principal components [4], and Kshirsagar et al. have used 7 [13], 8 [14], and 9 [15] components.

It is the PCs that are the parameters (targets) that need to be interpolated in our approach. In the results section, we focus on the PC 1, which relates to the degree that the mouth is open. To determine the range constraints for this PC, the captured visemes were ordered according to the amount of mouth opening. Using this viseme order, the range constraint values were set accordingly using a relative scale. The same range constraint values were set for all other PCs for all visemes. Whilst PC 2 does influence the amount of mouth rounding, we decided to focus on PC 1 to illustrate our approach. Other PCs only give subtle mouth shape differences and are difficult to determine manually. We hope to address this by working on measuring range constraints for static visemes using continuous speaker video. The acceleration constraint is also set for each PC.

5. Results

The Mexican-Spanish talking head was tested with the sentence “hola, cómo estas?”. Figure 6 shows the results of the mouth shape at the time of pronouncing each phoneme in the sentence . Figures 7 and 8 illustrate what is happening for the first PC in producing the results of Figure 6. The pink curves in Figures 7 and 8 show that the global constraint value is set high enough so that all the ideal targets (mouth shapes) are met (visual results in Figure 6(a)). Figure 6(b) and the blue curves in Figures 7 and 8 illustrate what happens when the global constraint is reduced. In Figure 8, the acceleration (blue curve) is restricted by the global acceleration constraint (horizontal blue line). Thus, the blue spline curve in Figure 7 does not meet the ideal targets. Thus, some of the mouth shapes in Figure 6(b) are restricted. The more notable differences are at the second row (phoneme l), at the fifth row (phoneme o), and at the tenth row (phoneme t).

In each of the previous examples, both the global constraint and the range constraint could be satisfied. Making the global constraint smaller could, however, lead to an unstable system, where the two kinds of constraints are “fighting.” In an unstable system, it is impossible to find a solution that satisfies both kinds of constraints; and as a result, the system jumps from a solution that satisfies the global constraint to one that satisfies the range constraint in an undetermined way leading to no convergence. To make the system stable under such conditions, there are two options: relax the range constraints or relax the global constraint. The decision on what constraint to relax will depend on what kind of animation is wanted. If we were interested in preserving speaker-dependent animation, we would relax the global constraints as the range constraints encode the boundaries of the manner of articulation of that speaker. If we were interested in producing mumbling effects or producing animation where we were not interested in preserving the speaker’s manner of articulation, then the range constraint could be relaxed.

Figure 6(c) and the green curves in Figures 7 and 8 illustrate what happens when the global constraint was reduced further so as to make the system unstable, and the range constraints were relaxed to produce stability again. In Figure 7, the green curve does not satisfy the original range constraints (solid red lines), but does satisfy the relaxed range constraints (dotted red lines). Visual differences can be observed in Figure 6 at the second row (phoneme l), where the mouth is less open in Figure 6(c) than in Figures 6(a) and 6(b). This is also apparent at the fifth row (phoneme o) and at the tenth row (phoneme t).

For Figure 6(d), the speed of speaking was decreased resulting in a doubling of the time taken to say the test sentence. The global constraint was set at the same value as for Figure 6(c), but this time the range constraints were not relaxed. However, the change in speaking speed means that the constraints have time to be satisfied as illustrated in Figures 9 and 10 .

As a final comment, the shape of any facial pose in the animation sequence will be most influenced by its closest visemes. The nature of the constraint-based approach means that the neighborhood of influence includes all visemes, but is at its strongest within a region of 1-2 visemes, either side of the facial pose being considered. This range corresponds to most common coarticulation effects, although contextual effects have been observed up to 7 visemes away [16].

6. Conclusions

We have produced a Mexican-Spanish talking head that uses a constraint-based approach to create realistic-looking speech trajectories. The approach accommodates speaker variability and the pronunciation variability of an individual speaker, and produces coarticulation effects. We have demonstrated this variability by altering the global constraint, relaxing the range constraints, and changing the speed of speaking. Currently, PCA is employed to reduce the amount of data used in the optimization approach. However, it is not clear that this produces a suitable set of parameters to control. We are currently considering alternative parameterizations.

Acknowledgments

The authors would like to thank Miguel Salas and Jorge Arroyo. They also like to express their thanks to CONACYT.