Abstract

This paper describes a method for estimating the internal state of a user of a spoken dialog system before his/her first input utterance. When actually using a dialog-based system, the user is often perplexed by the prompt. A typical system provides more detailed information to a user who is taking time to make an input utterance, but such assistance is nuisance if the user is merely considering how to answer the prompt. To respond appropriately, the spoken dialog system should be able to consider the user’s internal state before the user’s input. Conventional studies on user modeling have focused on the linguistic information of the utterance for estimating the user’s internal state, but this approach cannot estimate the user’s state until the end of the user’s first utterance. Therefore, we focused on the user’s nonverbal output such as fillers, silence, or head-moving until the beginning of the input utterance. The experimental data was collected on a Wizard of Oz basis, and the labels were decided by five evaluators. Finally, we conducted a discrimination experiment with the trained user model using combined features. As a three-class discrimination result, we obtained about 85% accuracy in an open test.

1. Introduction

Speech is the most basic medium of human-human communication and is expected to be one of the main modalities of more flexible man-machine interaction along with various intuitive interfaces rather than traditional text-based interfaces. One major topic of speech-based interfaces is the spoken dialog system. Studies on spoken dialog systems have attempted to introduce a user model, which models the user’s internal states, to make the dialog more flexible. The user’s internal states represent various aspects of the user, such as belief [1], preference [2], emotion [3], and familiarity with the system [46]. These aspects can also be categorized according to their persistency: as the user’s knowledge and preference are persistent, they can be used for personalizing of the dialog system [7]. Other internal states such as emotion or belief are transient and so are used for making a dialog more natural and smooth. These kinds of internal state should be estimated session-by-session. In this paper, we focus on the latter, transient states.

These internal states are estimated based on the verbal and nonverbal information included in the interaction between the user and the system. In this work, we consider a system-initiative dialog system that presents a prompt message at the beginning of a session. In such a system, a session between the user and the dialog system can be divided into three phases: before the system prompt (Phase 1), after the prompt and before the user’s response (Phase 2), and the rest (Phase 3). Figure 1 shows the three phases in a session.

Many conventional studies on user modeling have focused on the linguistic information of the user’s utterance and estimated the user’s internal states based on the dialog history (i.e., previously observed utterances made by the user and the system) [810]. As these works use the user’s utterances as the basis for estimating the internal states, these methods can only be used in Phase 3, because no linguistic information (history of user utterances) is available in Phase 1 or Phase 2.

Recognition of the internal states of the conversation partner (i.e., the user of the system) is related to social interaction in human-human communication. Studies on the social interaction of human-computer interfaces have included conversations with robots [1113] and virtual agents [1417]. An important cue to recognize social interaction is nonverbal information. These studies employ some multimodal information such as hand gestures, head nods, face direction, and gaze direction as well as spoken language to build teamwork in the collaboration with a robot [11, 12] or to create a chance to address the user [13]. Meanwhile, Maatman et al. [14] and Kopp et al. [15] studied the natural behavior of the agent while the user is speaking. These virtual agents need to generate nonverbal outputs to the conversation partner, and these outputs directly affect the naturalness of the dialog. Buß and Schlangen [18] defined the short utterance segment in continuous speech as a subutterance phenomenon and analyzed the roles of the turn-taking or back channels. These works also focused on Phase 3 of a session.

On the other hand, several works investigated the internal states in Phase 1. Hudson et al. [19] and Begole et al. [20] discriminated a user’s interruptibility by analyzing subjects in an office and a user of instant messages, respectively. In addition, Satake et al. [21] studied the appropriateness of the approaching behavior when addressing people in human-robot dialog.

In the present work, we investigate how to estimate the user’s internal state in Phase 2. We believe that user modeling in Phase 2 is important, even though it has not been investigated to date. First we explain the issue of user modeling in Phase 2 in detail. For example, if the user does not understand the meaning of the system’s prompt, he or she could abandon the session without uttering a word (case 1). The other possibility is that the user is considering how to answer the prompt (case 2). In both of these cases, the conventional systems respond uniformly even though different responses are desired, because they treat both cases as “the user did not know what to say.”

In real systems, a heuristic solution such as incremental prompt [22] is employed, where the system offers a different prompt if the user does not respond within a certain time after the first prompt. The system should provide more detailed information to clarify the intention of the prompt in case 1; however, in case 2, intervention from the system in the dialog is undesirable. In fact, Kobayashi et al. [17] described that a system response that does not consider the difference between cases 1 and 2 will confuse the user. This problem is especially serious for systems with dialogs that finish in one or two user utterances.

We considered these two dialog cases and assumed that they are derived from different internal states. Here, we define three internal states. In the first one (State A), the user does not know how to answer the prompt. In the second one (State B), the user is taking time to consider the answer. In the third one (State C), the user has no difficulty in answering the system.

The goal of our task is to discriminate the internal state of a user among these three States A, B, and C by observing the user’s behavior. This challenge is close to the research on turn-taking of the dialogue. There have been a number of works that analyze turn-taking behavior for human-human dialogs [23] as well as human-machine dialogs [2428]. Introducing the turn-taking mechanism to a spoken dialog system is believed to make the dialog system to interact to the user in more natural and effective way. Notably, Edlund and Nordstrand [29] have researched on the multimodal dialog system and examined the turn-taking between the user and an agent (an animated talking head) that made a gesture such as head motion and gaze control.

Most of these works focused on the turn-taking behavior after the dialogue establishment (i.e., Phase 3). Some of the knowledge obtained from these works might be useful for our problem; however, it seems to be difficult to discriminate the user’s internal state at Phase 2 such as State A and State B using only cues for turn-taking.

To discriminate the user’s internal state without observing the user’s verbal utterances, we exploit audio and visual features for the estimation and investigate which features can be used for discriminating these internal states. Ideally, all features should be extracted automatically and the discrimination should be made incrementally; however, in this paper, we manually labeled the data to extract a part of the features and used the entire video sequence for discrimination, because the objective of this paper is to investigate which feature can be used for this purpose. Automatic extraction of the useful features and incremental discrimination are issues for future works.

This paper is organized as follows. Preparation of the experimental data is described in Section 2. Then the audio features and visual features are introduced in Sections 3 and 4, respectively. Finally, the results of the experiments are presented in Section 5.

2. Collection and Analysis of Dialog Data

2.1. Dialog Tasks

We collected dialog data to analyze internal states of the users. There are two possibilities for collecting dialog data: collecting acted dialogs using actors and collecting natural dialogs using naïve participants. The merit of acted dialogs is ease of collecting dialogs with various properties such as emotions and intentions, but such dialog tends to be unnatural [30]. Therefore, we decided to collect natural dialogs.

For the experiments, we prepared two tasks: (1) a simple information retrieval task and (2) a “question and answer” task. The information retrieval task simulated a restaurant guidance system. This system was task-oriented, and the user has to answer to a series of prompts to achieve the task. For example,System: Are you looking for a restaurant of any region?User: At region A.System: How much is your budget?In the experiment, the operator required the subjects to achieve ambiguous task to search an “affordable” restaurant of Japanese food.

Because we thought the user’s internal states are not expressed frequently by using a natural system, we prepared the “question-and-answer” task, where the system asks a question and the user answers, in order to make the user express their internal states. An example of the “question-and-answer” session is as follows:System: What is the date today?User: Uhm…, it’s May … May 17th today.The questions were independent of each other and presented at random.

2.2. Data Collection Procedure

The data collection was carried out on a Wizard-of-Oz basis. We prepared an agent with a simple cartoon-like face and synthesized voice and displayed it on an LCD monitor to encourage the user to pay attention to the front of the system. We prepared several faces with emotion (neutral, joyful, angry, sad) for the agent. The agent had always the neutral face at the beginning of the system prompt and changed its facial expression after the user’s utterance. The expression was decided by the operator (the “wizard”) according to appropriateness of the user’s response.

Figure 2 shows the experimental environment. We placed a digital video camera between the monitor and the user and recorded the user’s frontal face. The operator (the first author of this paper) operated the system behind the partition. In either task, the prompt was repeated at 15-second intervals if user did not make the input utterance.

After recording a dialog, we segmented the recorded dialog into “sessions,” where one session included the system’s prompt and the user’s response. When a user’s utterance was not observed, that part (from the beginning of the system’s prompt to the beginning of the next prompt) was used as a session.

Figure 3 shows an overview of one dialog session. We excluded the segment after the beginning of the user’s input utterance from a session because our interest is the duration before the user’s utterance. Because the agent’s facial expression changes after the user’s utterance, the influence of facial expression change of the agent to the user’s attitude was also excluded from the later analysis.

As we split one dialog into more than one session, some of the sessions were turns in Phase 3. Here, the user’s internal state at Phase 3 just after the system’s prompt is supposed to be quite similar to that at the Phase 2, except for an existence of contextual information. Due to this, we used all the “sessions” of the dialog data for the later experiment even if the session belonged to Phase 3. The estimation of the user’s internal state at Phase 3 is expected to be improved by using the contextual information; however, this issue is not covered here.

We asked nine subjects (eight males and one female) to make conversation with the dialog system. Number of dialogs for one subject was different from subject to subject. The total number of session was 199 (22.1 sessions per user in average, ).

2.3. Subjective Evaluation of the Internal States

Next, we evaluated the users’ internal states to make “ground truths” of the internal states. There could be two possibilities for making ground truths: one is based on the user’s introspection or intuition, and the other one is based on the observer’s opinion. Here, the goal of our work is to develop a multimodal dialog system that behaves like a human receptionist, who should determine the user’s internal state by only observing the user’s behavior. Therefore, we decided to evaluate the sessions by evaluators’ observation. Agreement of these two evaluations is an interesting issue [31] to investigate in a future work.

The gathered sessions were labeled by five evaluators. They were asked to watch the recorded video of the entire session and classify each session into one of three states:State A: The user was perplexed by the system's promptState B: The user was considering an answer to the prompt State C: Neither of the above (neutral)

Table 1 shows the number of matches of evaluations by five evaluators and Table 2 shows the results of a majority vote. As we expected, user’s internal state such as State A and B were often appeared in the task (2). Moreover, concordance ratio of each evaluator’s decision is shown in Table 3. The concordance ratio was calculated by the following equation: Here, is the number of matched label between the evaluator and the evaluator , and is the total number of the sessions (). The results show that the decision of the evaluator E2, E3, and E4 are well accorded; however, the decision of evaluator E5 does not match with that of others too much. Therefore, we decided to use 195 sessions by the majority vote as the experimental data. Four sessions were excluded because they had the same number of votes for two different labels.

As mentioned, we estimate the user’s internal state without referring to the user’s previous utterance (i.e., Phase 2). Therefore, we have to obtain features for estimation from audio signals observed in the segment before the user’s input. Note that the user may make utterances other than the answer, such as filler words or interjections, which could provide clues as to the user's internal state.

3. Speech-Based Features

3.1. The Length until User’s Input

First, we examined the length between the end of the system’s prompt utterance and the beginning of the user’s answering utterance (denoted as hereafter) as a speech feature. This period contains silence, repairs, fillers, and breathy voice of the user. We manually determined this length for each dialog. Figure 4 shows the mean length of this segment in the “neutral” and “other” dialogs, where we can see large differences between the two types of dialog.

This result reflects the fact that the evaluators tended to label the user’s internal state as “neutral” in sessions where the user answered to the prompt immediately.

3.2. The Length of Speech Classification

Next, we investigated the audio signal of the segment before the user’s input in detail. As we can discriminate state C and the other states using the feature explained above, the remaining problem is how to discriminate dialog of States A and B.

To find features that will assist the discrimination, we classified the acoustic events in the observed signal into six classes as shown in Table 4, then investigated the total length of events belonging to each class. Among the classes shown in Table 4, the “system” segment is for utterances by the system, and all of the other classes are for utterances by the user.

We investigated the length of events of each class for all dialogs classified into States A and B and observed the difference in length between the two internal states in order to find features effective for discrimination. Let be the number of dialogs, the number of acoustic events of class observed in the th dialog, and the total length of events belonging to class observed in the th dialog. Then we observed the length of events of a specific class using two normalization methods. The first one is the length of events normalized by number of dialogs: and the other one is that normalized by number of events:

The results yielded by the two normalization methods are different because of the difference in frequency of events. Using , the value for a rare event tends to be small, whereas the value can be larger when evaluating it by .

and for each class are shown in Figures 5 and 6, respectively. We performed the unpaired -test for each segment to decide the efficient features, then chose those features that showed a significant difference between the two classes at the 5 significance level. As a result, significance differences are observed at for fillers and for silences. Therefore, we chose the length of the filler and silence as efficient features for discriminating between States A and B. These facts indicate that subjects who were thinking about the answer (e.g., who were labeled as State B) tended to be silent before answering, and so a long filler was considered to be a sign of “thinking.”

3.3. Length of Filled Pause

Upon analyzing the sessions of State B, we found that vowels tended to be lengthened in utterances other than filler words. This phenomenon is called filled pause. Goto et al. proposed a method to detect filled pauses [32] and described that filled pauses serve to maintain the speaker’s turn and to express the mental state while thinking of the next utterance.

Goto et al. focused on the features of filled pauses such as little F0 and spectrum variation and formulated the filled pause likelihood. Here, the value of the F0 transition and the spectral envelope deformation at frame are defined as () and . () is obtained as the slope of linearized temporal transition of F0. Meanwhile, is the product of the slope and the fitting error of the temporal transition of the linearized spectral envelope. The F0 transition and spectral envelope transition are linearized by least-squares fitting. Then, the filled pause likelihood is calculated as Here, and are averaged by a short period, that is, is 10 frame shifts. and are heuristic values, and we set them as and here. Finally, Goto et al. calculated the accumulated sum of as long as . If the sum of at frame is larger than a certain threshold (we set this value as ), then the segment of frame is judged to be a filled pause segment.

Filled pause features are thought to affect the choice of label by the evaluators. We extracted the filled pauses contained in the user’s utterance using Goto’s method and used the length of filled pause () as a feature for discriminating States A and B.

4. Vision-Based Features

4.1. Distribution of Face Orientation

We analyzed the face orientation of the user in the segment before the user’s input as a visual feature. First, we investigated the tendency of the face orientation in the dialog data. We manually labeled the user’s face orientation as one of nine directions including frontal. Figure 7 shows the distribution of face orientation. We conducted the unpaired -test on the frequency of face direction in States A and B and found a significant difference in the frequency of the frontal frames. This result shows that the users who were considering the answer tended to turn their face away from the system compared to the perplexed users. From this observation, face orientation is considered to be efficient for discriminating a user’s internal States A and B.

4.2. Face Orientation Feature

As mentioned above, the face orientation feature is thought to be effective for identifying a user’s internal state, but the classification into nine orientations is too coarse and was too difficult for the actual system to use because the labels were assigned manually. Therefore, we carried out automatic estimation of face orientation and estimated the face direction as a continuous value.

First, we detected the parts of the face (eyes and nose) in each frame. This was done by template matching of the face region in a frame, which was determined through face detection [33] and tracking [34]. We then checked the detection results and corrected all the misdetected frames manually to investigate the efficiency of the feature excluding the effect of estimation error.

Next, we calculated three-dimensional face orientation (, , and ) based on the face region and positions of the parts. We approximated the shape of the human head using a sphere and used a sine value instead of the angle (Figure 8) Here, is the diameter of the head, estimated as the width of the face region. The points and are feature points in a frame, as shown in Table 5. We calculate , , and frame by frame and denote these values at frame as , , and , respectively.

Figure 9 shows an example of the face orientation calculation. In this example, the user turns his face from the front to the lower right from frames 50 to 80. There is a large change of and around frames 50 to 80, as well as a small change of .

4.3. Data Compression

The results of a preliminary examination showed that the simple descriptive statistics of face orientation (e.g., mean, variance, maximum, and minimum) were not effective for discrimination. Therefore, we tried to treat the face orientation feature as sequential data. However, the number of values of face orientation depends on the number of frames of the session and varies from around 100 to 700. To simplify the calculation of discrimination function, we need to extract feature vectors of fixed dimension from these face orientation values. To achieve this, we used the piecewise aggregate approximation (PAA) method [35], which linearly compresses the face orientation vectors into a fixed number of vectors.

The compressed feature vectors are calculated from the face orientation vectors () as follows:

Figure 10 shows an example of a 110 point face orientation sequence compressed into 15 dimensions by PAA.

The original data of this example is the same as the uppermost one in Figure 9, and we found that the number of data is reduced while retaining the rough deviation of the original signal.

5. Discrimination Experiment

5.1. Experimental Method

We carried out an experiment for discriminating the three classes of the user’s internal state (e.g., State A, B, and C) using the Support Vector Machine (SVM) [36]. The elements of the feature vector are employed from the features examined above (details of the features will be described in Section 5.2). We used libSVM [37] with a linear kernel for the experiments. The simple pairwise method was employed for multiclass discrimination. The experiments were carried out by cross-validation opened for each subject. Therefore, we conducted a 9-fold validation test, in which the amounts of training data and test data of each fold were unequal.

5.2. Feature Set

We prepared two feature sets for the discrimination because we have investigated both “discrete” and “continuous” face orientation features. The discrete face orientation is represented by face direction symbols decided manually. We defined the feature set including the discrete face orientation feature as set (a). On the other hand, the continuous face orientation feature is calculated by the previously explained image processing and compressed by PAA. The feature set including the continuous face orientation feature is defined as set (b). Furthermore, we prepared the feature set (c) to examine the effect of the length of filled pause.

Feature Set (a)
The values correspond to the nine face orientations (see Figure 11) expressed as a nine-dimensional vector for the feature of each frame Then these features are averaged over the period. Let and be the frames when the system’s prompt ends and the user’s answer starts, respectively. We calculate the face orientation feature as
Next, we add the speech-based features to the face orientation feature. Finally, the feature vector is composed as follows: Here, is the length of a silence segment and is the length of a filler segment.

Feature Set (b)
Feature set (b) includes the continuous face orientation value as described in Section 4.2. This set was prepared to compare the effectiveness on the discrimination between the continuous face orientation feature and the discrete one. We employed compressed face orientation data. According to a preliminary experiment, we decided to compress the face orientation vectors into the 15 points () that gave the best classification accuracy. This set also included three speech-based features that were same as those in the feature set (a), and the total number of dimensions of feature vector was 48.

Feature Set (c)
This feature set was prepared to examine the effect of the filled pause feature. In addition to the feature in set (b), the feature set (c) included the length of filled pause . Therefore, the total number of dimensions of feature vector was 49. The elements of each feature vector are summarized in Table 6.

5.3. Experimental Results

Finally, we carried out an experiment using the features using the above-mentioned feature sets. The results are shown in Table 7. Each row of Table 7 shows the discrimination results for each feature set. From Table 7, we find that the discrimination accuracy using feature set (b) was higher than feature set (a), and that using feature set (c) was the highest. To validate the statistical significance of the difference among the total discrimination rates, we conducted one-way repeated measures ANOVA, where the feature set was the factor and nine results obtained by different cross validation experiments were the repeated measure. As a result, we could not find any significant difference among the three feature sets.

Although we could not observe significant differences among the features, we can conclude that the continuous face orientation feature is effective compared with the face orientation distribution because the continuous face orientation can be obtained automatically, which is indispensable for realization of automatic estimation of the internal state. The filled-pause feature did not give significant improvement either, but there might be a possibility to obtain more improvement by combining the filled-pause feature with other multimodal features.

6. Conclusion

In this paper, we defined three internal states (State A, B, and C) of a user of a dialog system before the first user utterance and investigated methods of modeling these states. It is important to estimate the user’s internal state before the user’s input in order to make the response of the system more appropriate, but this issue has not been studied to date. In this paper, we focused on the speech-based and face orientation features before the user’s input because they are considered to express the user’s internal state. As speech-based features, we used four features: the length until the user’s input, the length of filler segments, the length of silent segments, and the length of filled pauses. As face orientation features, we used the sequence of three-dimensional face rotation angles and discrete face orientation frequencies. From the results of discrimination experiments, we examined the efficiency of these features. We obtained the discrimination accuracy as high as 85.6%, but we could not observe significant differences between the two face orientation features.

A remaining problem is that, the features proposed in this study are not available until the user’s input utterance is observed, because we examined the segment until “just” before the user’s input. In addition, the speech-based features and face orientation features were examined independently; however, the correlation between both features is important in practice. Therefore, in a future work, we will employ a sequential discrimination method such as HMM or CRF to analyze and select the features. Moreover, the total number of data in our experiment was not enough, so we need to collect more data and confirm the validity of our proposed discrimination method and features.