Abstract

The production of emotional speech is determined by the movement of the speaker’s tongue, lips, and jaw. In order to combine articulatory data and acoustic data of speakers, articulatory-to-acoustic conversion of emotional speech has been studied. In this paper, parameters of LSSVM model have been optimized using the PSO method, and the optimized PSO-LSSVM model was applied to the articulatory-to-acoustic conversion. The root mean square error (RMSE) and mean Mel-cepstral distortion (MMCD) have been used to evaluate the results of conversion; the evaluated result illustrates that MMCD of MFCC is 1.508 dB, and RMSE of the second formant (F2) is 25.10 Hz. The results of this research can be further applied to the feature fusion of emotion speech recognition to improve the accuracy of emotion recognition.

1. Introduction

In recent decades, the development of artificial intelligence (AI) has been very rapid, and human-computer interaction technology needs that there would be harmoniously communicated relationship between the human being and the intelligent machine [1, 2]. All of emotional speech processing methods must use the human pronunciation mechanism as the basement; thus speech is pronounced with success by the movements of articulators, for example, the tongue, lips, and jaw [3]. The mapping between kinematics and acoustics has been taking form through the gathering of vast pronounced experience.

At the same time, since the notion of “emotion calculator” was proposed by the MIT Media Lab, many physiological signals have been successively applied as characteristic information in the research of emotional speech recognition, so as to help computers better analyze the emotional state of speakers from their speech signals [1]. However, as an important part of emotional speech generation, the kinematic data of articulators have not been widely used in the research of speech emotion recognition [4]. The reasons are mainly as follows: data collection of vocal organs is difficult [5]; the relationship between the articulatory features of the articulators and the mature acoustic features is unclear, and the fusion of multiple types of features is difficult [6]. There is no standard bimodal emotional speech database combining articulatory and acoustic data [7]. However, the articulatory data of the articulators are not susceptible to noise, and their features have the advantage of being more robust than the acoustic features [8].

In order to integrate articulatory features into the feature network of speech emotion recognition to improve the rate of speech emotion recognition, the domestic and foreign researchers have studied the conversion method of speech articulatory-to-acoustic features. Common conversion methods can be divided into statistical mapping and data-driven methods [912]. Among them, the typical method of statistical mapping is codebook-based method, which builds codebook to store the mapping pair of acoustic-articulatory features and applies the algorithm to find the optimal mapping pair to confirm the relationship between articulatory and acoustic features. This method was first proposed in 1996 by Hogden et al. [13], who constructed the mapping relationship between acoustic features and articulatory features, and the method used vector quantization to encode the features and needs huge data to realize accurate mapping.

The idea of data-driven method is to construct a model through training data to realize the conversion of articulatory and acoustic features, and typical examples of such method are neural networks and Gaussian mixture models (GMM). As early as 2002, Richmond [9] successfully mapped acoustic features to articulatory features using neural network, constructed a mixed density network containing multilayer perceptions, and applied the data of MOCHA-TIMIT database to study the acoustic-to-articulatory inversion and finally realized the inversion effect with root mean square error value of 1.4 mm. As long as the data are sufficient, the method can achieve a more ideal effect. However, the number of hidden layers and the number of nodes in each hidden layer involving the structure of the network need special attention. Meanwhile, the method has not fully considered the effect of constraints such as phoneme and time. In addition, GMM-based inversion methods have problems of oversmoothing and overfitting, which affect the effect of inversion [14, 15].

The least squares support vector machine (LSSVM) is a new type of support vector machine (SVM), which is often used to solve the problem of pattern classification and function estimation. The particle swarm optimization (PSO) is a kind of optimization algorithm based on animal swarm motion research, which is commonly used for path planning and parameter optimization. Fairee et al. [16] implemented acoustic-to-articulatory inversion using PSO algorithm in 2015, and Yogesh et al. [17] used a hybrid PSO algorithm to study the emotional and pressure recognition of speech and achieved good results. With the development of deep learning method, BiLSTM-CNN-word attention theory is adopted to realize the articulatory-to-acoustic conversion in Mandarin and realize the recognition effect of RMSE of F2, 22.10 Hz, but the conversion effect under different emotions has not been studied [18].

Thus, combining PSO optimization and LSSVM, articulatory-to-acoustic conversion on emotional speech in Mandarin has been studied in this paper. The paper is organized as follows. Firstly, this paper reviews related work on articulatory-to-acoustic conversion and LSSVM and PSO in Section 2. Secondly, the detailed method we proposed is elaborated in Section 3, and Section 4 illustrates our experiments and results. Section 5 discusses and concludes the work of this paper.

For exploring the articulatory-to-acoustic conversion and improving the conversion effect, we proposed PSO-LSSVM model to achieve the conversion, which includes LSSVM and PSO. Meanwhile, Gaussian mixture model (GMM), as a classical conversion model [8], has been adopted to comprising the conversion effect with LSSVM and PSO-LSSVM in our study. In this section, we will give a brief introduction about GMM, LSSVM, and PSO.

2.1. Articulatory-to-Acoustic Conversion Based on GMM Used in Emotional Speech

GMM has been used for the calculation of the joint probability density of acoustic and articulatory characters to achieve the conversion. The conversion can be described by the following equation:

Here, M represents the amount of Gaussian mixture element, shows the probability of acoustic feature vector , and shows full covariance matrix of conditional Gaussian distribution. and are considered as articulatory and acoustic characters, respectively. For the articulatory characters of frame i, the first-order dynamic features can be shown as follows:

The articulatory features and the first-order dynamic features are spliced as the input character vector, and then output vector will be obtained. Thus, the joint probability distribution of input and output vectors can be shown as follows:

Here, shows the joint vector of articulatory and acoustic characters, and N represents the number of Gaussian elements, shows the parameters of GMM, and , , and represent weight, mean, and covariance of Gaussian element j, respectively. Among them, model parameter can be estimated relying on maximum likelihood estimation algorithm (MLEA) [6].

During the conversion, input articulatory features are considered as , and output acoustic feature are considered as ; can be calculated relying on the MLE, which is shown in the following equation:where W is dynamic window coefficient matrix.

If we only focus on Gaussian element, the Gaussian element will be obtained through Maximum Posterior Probability, as shown in the following equation:

It is supposed that the frames are independent; for frame i,

Among them, and can be considered as mean and covariance matrix, respectively, and they are calculated using the two following equations:

Thus, we can obtain the output vector using maximum likelihood criterion, as shown in the following equation:

Here, denotes square matrix and can be calculated through .

2.2. LSSVM

Support vector machine (SVM) algorithm is a data-driven method of machine learning-based statistical theory proposed by Vapnikli [19] in the 20th century, the thought of which is to map the input vector to a high-dimensional feature space by nonlinear mapping method and solve the nonlinear features and regression problems of the original space using linear machine learning methods. The original SVM algorithm can be used to solve the classification problem in the pattern recognition. Later, the SVM regression algorithm can be used to solve the nonlinear regression problem by defining the insensitive loss function.

Support vector machine (SVM) can transform feature prediction problems into linear programming or quadratic programming problems to solve local minimum problems for achieving global optimization. The kernel function is used to replace the inner product of the high-dimensional feature space, so that the high-dimensional space problem can be solved properly. At the same time, the SVM method can balance the weight relation between the fitting ability and the generalization ability by adjusting the penalty parameters through the capacity, which has the advantages of simple structure and good sparsity. SVM can not only achieve the goal of structural risk minimization but also effectively deal with nonlinear, high-dimensional, local minimization and overlearning problems and has been successfully applied to speech recognition, speech feature prediction, and so forth.

In order to solve the disadvantages of SVM, Suykens [20] proposed the least squares support vector machine (LSSVM) theory. Compared to SVM, LSSVM is more suitable for parameter prediction and pattern recognition with small sample size [21].

It is supposed that training sample set is , where N is the number of training samples, is m dimensions input, and is target output size of 1 dimension. The nonlinear function estimated problem can be transferred to the linear function estimation in high-dimensional feature space using the following equation:

Here, is coefficient vector of weight, and is mapping function. This regression problem may represent an optimizing problem about equality constraint, the optimization aim of which is shown in the following equations:

Then, the Lagrange method is used to solve the above optimization problems:where is a Lagrange multiplier.

The partial derivative of equation (15) is obtained according to the optimization conditions:

Then, the kernel function is defined according to the Mercer condition:

It can be obtained from equations (16) and (17), which is shown as follows:

Finally, the nonlinear model of LSSVM is obtained as follows:

2.3. PSO

PSO is an evolutionary computation proposed by Dr. Eberhart and Dr. Kennedy in 1995. On the basis of observing the activity behavior of animal clusters, the algorithm uses the shared information of the particles in the group to analyze the movement of each particle and generates the evolution process from disorder to order in the solution space, so as to obtain the optimal solution.

PSO algorithm can initialize the problem to a group of random particles and then find the optimal solution through iteration. In each iteration, the particle updates itself by tracking individual and global extremes, where the individual extreme value can be expressed by the following equation:

The global extreme value can be expressed as follows:

In addition, if a small part of the population is regarded as the nearest neighbor of the particle instead of the whole, then the extremum found in all the nearest neighbors of the particle becomes the local extremum, which is expressed as

When the particle iterates for t times, its position information is represented by equation (23) and its velocity is represented by equation (24).

Equations (25) and (26) are used to update the velocity and position of particles iteratively.

In both of these equations, is the number of iterations of the particle at this time, is the inertia weight coefficient of the particle, and are learning factors, which illustrate acceleration weight from particle to individual extremum and global extremum, respectively, and and are random digits between 0 and 1; .

In this paper, RMSE of the model calculation result has been adopted to the adaption function of PSO, obtaining optimized parameters and .

3. Methods

3.1. Articulatory-to-Acoustic Conversion Model Based on LSSVM

The algorithm flow of feature conversion model based on LSSVM is shown in Figure 1, and the specific process is as follows:(1)Articulatory features and acoustic features were synchronously extracted from the bimodal emotional speech database(2)The training set and test set were randomly divided, and the characteristic data were normalized(3)The LSSVM model was initialized, and the radial basis kernel function was selected for the kernel function in the model(4)Parameter optimization of LSSVM model was carried out by cross validation, where and were mainly optimized parameters(5)The articulatory features in the test set were imported into the LSSVM model for parameter prediction, and then acoustic feature probability sequence was generated(6)Maximum likelihood estimation algorithm (MLEM) was used to estimate acoustic feature parameters

In the LSSVM model, selection of radial basis kernel function firstly requires setting the value of kernel parameter and penalty factor. The smaller the value of is, the stronger the generalization ability will be and the better the smoothing ability of the model will be, but the fitting ability will decrease. The greater is, the smoother the training model will be and the stronger the generalization ability will be. In this paper, PSO algorithm is used to optimize the above two parameters in the LSSVM algorithm.

3.2. Optimization of PSO Algorithm

Figure 2 shows the flowchart of optimizing penalty factor and kernel parameter in LSSVM model with PSO algorithm.

It can be seen from the figure that the main process of PSO algorithm optimization is as follows:(1)Initialization of algorithm, where specific content includes the determination of fitness function, population, and movement speed.(2)Calculating fitness function and scale.(3)Determining the optimal solution of the algorithm by calculating the extreme value of particle position and velocity and updating the velocity and position.(4)Acquiring output parameters and according to the optimal solution of the system.

4. Experiments and Results

4.1. Material

The data used in this paper were from the TYUT bimodal Mandarin emotional speech database. Based on the theory of experimental phonetics and emotional speech, it is recorded, screened, and preprocessed by us [22]. The articulatory data and acoustic data of the participants in this database have been collected synchronously by the 3D electromagnetic articulography of Carstens, Germany, as shown in Figure 3 [23].

In this paper, 334 disyllable-word materials from 10 participants were selected for research, including four emotions, happiness, anger, sadness, and neutral, which are shown in Table 1.

Firstly, the articulatory data and acoustic data of emotional speech have been synchronized, and the frame length was 4 ms. Secondly, the acoustic data and articulatory data of the speech were extracted by the feature, respectively. The acoustic feature is extracted by the 12-dimensional MFCC and the second formant (F2). The articulatory features were extracted into eight dimensions, namely, the velocities of the tip of the tongue, the mid of the tongue, and the velocities of the tongue root in the up-down and forward-backward directions, as well as the lip protrusion [24] and the lip aperture [25].

Among them, lip aperture and lip protrusion are secondary features, which are determined by the position information of the four sensors attached to the upper and lower lips. Lip protrusion is the distance from the upper incisor to the upper lip calculated from the Euclidean distance between the two sensors. The lip aperture can be obtained from the following equation:where and show the z-coordinate values of the upper and lower lips, respectively, closedposition represents the distance of the upper and lower lips on the z-axis when the lips are closed, and represents the distance of the upper and lower lips on the z-axis at the maximum lip opening.

4.2. Partition and Preprocessing of Datasets

The test set and training set have been divided using random selection method; 224 pieces of data have been selected as training data, and 112 pieces of data as test data. Then, features in test and training sets were normalized.

4.3. Model Comparison of EMA-to-F2 Conversion on Emotional Speech

In the EMA-to-F2 conversion, the GMM-based, LSSVM-based, and PSO-LSSVM-based methods have been compared. The RMSE and Collection Coefficient between the true and prediction data have been adopted to measure the conversion result. Adaption degree of PSO was used to reflect the optimization ability of PSO algorithm.

The fitness curve of PSO algorithm is shown in Figure 4.

In Figure 4, the learning factors and of PSO algorithm are both 1.5, the number of particles is 30, and the number of iterations is 300. The optimal fitness was achieved after 80 iterations of optimization.

The conversion results of the test set are shown in Figures 5 and 6. Among them, Figure 5 is the conversion result with LSSVM, and Figure 6 is the conversion result with PSO-LSSVM algorithm. In order to visually compare the conversion results of data, only 80 frames are randomly selected in the figure and the converted acoustic feature is the second formant F2 for observation.

It can be seen from the figure that the algorithm optimized with LSSVM by PSO is better than that using LSSVM algorithm. Furthermore, the error analysis of the conversion results is carried out. Here, RMSE is used as the evaluation standard for the conversion of kinematic characteristics to F2, which is defined aswhere and are the true and predicted acoustic features, respectively, i represents the serial number of the frame, N represents the amount of frames in the test set, and represents expectancy of the conversion model.

The conversion accuracy of data of various emotional characteristics is different, which is shown in Table 2. It can be seen from the table that the conversion effect of LSSVM algorithm optimized by PSO is significantly better than that of traditional LSSVM algorithm, while the conversion result of LSSVM algorithm is significantly better than that of GMM. The RMSE of the three conversion models are, respectively, 25.10 Hz, 30.20 Hz, and 36.52 Hz. In addition, anger and happiness are more effective emotions compared to sadness and neutral.

4.4. Model Comparison of EMA-to-MFCC Conversion on Emotional Speech

Since the converted MFCC in this paper is a multidimensional feature, MMCD [26] has been selected in this paper as the parameter to evaluate the conversion result of articulatory features to the 12-dimensional MFCC feature, which was defined as the mean of the Euclidian distance between the converted value of all frames and the real value.

Here, we compared the performances of the GMM-based, LSSVM-based, and PSO-LSSVM-based methods, and the comparison results are shown in Table 3.

It can be seen from the table that the conversion effect of LSSVM algorithm optimized by PSO is significantly better than that of traditional LSSVM algorithm, while the conversion result of LSSVM algorithm is significantly better than that of GMM. The MMCD of the three transformation models are 1.508 dB, 1.825 dB, and 2.384 dB, respectively. Moreover, both GMM and PSO-LSSVM conversion models shows that the conversion effect of anger and happiness is better than that of sadness and neutral, and the conversion result of sadness is slightly better than that of anger in LSSVM conversion result.

4.5. Application of the Converted Features into Speech Emotion Recognition

To verify whether the converted features obtained in this paper can distinguish the emotion type in bimodal speech, in this section, we used the converted features and true features in emotion recognition system based on SVM.

The data used in this section are exactly the same as that used in the PSO-LSSVM feature conversion system and have been derived from the bimodal emotional speech database. In the experiment, the training set and test set of data were randomly divided, and one-third of the data were selected as the test data and two-thirds as the training data. The recognition results are shown in Table 4.

As can be seen from Table 4, the average recognition rates of the converted F2 and MFCC features are 11.47% and 9.42% lower than the original articulatory features, respectively, which do not reach the recognition rate of 50%. At the same time, by comparing the converted MFCC with the MFCC features extracted from the corresponding audio data, the emotion recognition rate is 6.76% lower. Compared with F2 features extracted from audio data, the emotional recognition rate of converted F2 is 10.65% lower. The reason why the recognition rate of the transformed feature is lower than that of the real feature is the error between the transformed feature and the real feature, which requires further improvement of the conversion method to improve the conversion accuracy.

5. Discussion and Conclusion

In this paper, a kind of articulatory-to-acoustic conversion based on LSSVM has been proposed, and PSO optimization algorithm was used to optimize the model parameters, so as to realize the conversion of articulatory features in Mandarin bimodal emotional speech to 12-dimensional MFCC and the second formant F2. The experimental results show that the LSSVM model optimized by PSO algorithm can get high precision conversion results. In addition, in this paper, the application of conversion features to the emotion recognition of bimodal speech has been compared with the result of emotion recognition of articulatory features and real acoustic features. The average recognition rate of converted features is lower than that of real features and original features, especially the recognition rate of happiness and sadness emotions. This indicates that although the conversion accuracy of this paper has been improved to some extent compared with the previous methods, the feature conversion of emotional speech cannot meet the requirements for features of emotional speech recognition. Therefore, the rate of emotional recognition is far below 50%. Therefore, it is necessary to continue to improve the algorithm in the future conversion method research to improve the conversion effect. This also indicates that the features generated by the conversion system cannot be applied to emotion recognition as independent features but can be integrated with kinematic and acoustic features for emotion recognition.

The research of articulatory-to-acoustic conversion of emotion speech can lay a foundation for the research of bimodal emotion speech recognition. Of course, there are still some deficiencies in this paper. Firstly, the conversion accuracy of the feature can be further improved. Secondly, the content of target transformation features can be further enriched.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

Thanks are due to all the subjects in current experiment, to Guicheng Shao for technical assistance, to Jianmei Fu and Yanqin Xun for modal design, and to Jianzheng and Dong Li for assistance in data collection. This work was supported by Science and Technology Project of Xinzhou Teachers University (2018KY15 and 2018KY22), the Educational Reform Innovation Project of Shanxi Province of China (J2019174), and Academic Leader Project of Xinzhou Teachers University.