Abstract

To advance the study of lip-reading recognition in accordance with Chinese pronunciation norms, we carefully investigated Mandarin tone recognition based on visual information, in contrast to that of the previous character-based Chinese lip reading technique. In this paper, we mainly studied the vowel tonal transformation in Chinese pronunciation and designed a lightweight skipping convolution network framework (SCNet). And, the experimental results showed that the SCNet was sensitive to the more detailed description of the pitch change than that of the traditional model and achieved a better tone recognition effect and outstanding antiinterference performance. In addition, we conducted a more detailed study on the assistance of the deep texture information in lip-reading recognition. We found that the deep texture information has a significant effect on tone recognition, and the possibility of multimodal lip reading in Chinese tone recognition was confirmed. Similarly, we verified the role of the SCNet syllable tone recognition and found that the vowel and syllable tone recognition accuracy of our model was as high as 97.3%, which also showed the robustness of our proposed method for Chinese tone recognition and it can be widely used for tone recognition.

1. Introduction

In recent years, the superior performance of lip reading in robust speech recognition has received widespread attention. The goal of lip reading is to improve the robustness of speech recognition in special situations such as low signal-noise ratio (SNR) or silent environments. However, due to the complexity and variability of Chinese pronunciation, the performance of lip-reading recognition in Chinese is not always satisfactory in real-world scenarios.

One of the most important tasks of lip-reading recognition is feature extraction. Currently, there are two main categories of visual information extraction in the lip reading system, i.e., pixel-based methods and model-based methods. Pixel-based methods extract visual features from the image directly or after some preprocessing and transformation. Yuhas et al. [1] used the greyscale image pixel information of the lip and its surrounding areas as features. Wolff et al. [2] used the horizontal and vertical scanning lines centred on the lips as the eigenvector. Since the method of directly using the pixel information of the image as a feature is blind, more effective and targeted approaches, such as discrete cosine transform (DCT), principle component analysis (PCA), singular value decomposition (SVD), discrete wavelet transform (DWT), and linear discriminant analysis (LDA) [35], were proposed to reduce the information redundancy. The pixel-based method can make full use of pixel information to extract more comprehensive lip features. However, the feature vectors are high dimensional and redundant. Also, the pixel-based method is very sensitive to light, shadow, pronunciation, and other conditions. Besides, model-based methods aim to establish a parametric mathematical model and then use the model parameters to describe lip contour information. Kaynak et al. [6] used the horizontal and vertical distance of lip contours, the lip corner angle, and the first-order derivative of the lip corner angle. Zhang et al. [7] proposed geometric features of the lips, containing mouth width, upper/lower lip width, lip opening , and the distance between the horizontal lip line and the upper. Model-based methods utilize low dimensional features to express image features, and the feature is typically not changed by factors such as translation, rotation, scaling, or illumination. Nevertheless, both methods extract relevant information directly from the region of interest (ROI) in the planar image [8].

With the development of high-sensitivity RGB-D cameras, the three-dimensional information of the speaker’s face can be extracted more accurately. For instance, Yargıç and Muzaffer [9] developed a lip reading system that uses a Kinect camera to acquire the depth feature points and then extracts the angular features of the lip reading. Palecek et al. [10] studied the fusion performance of face depth data in isolated word visual speech recognition tasks. Rekik et al. [11, 12] proposed an adaptive lip-reading system based on image and depth data. Wang et al. [13] used 3D lip points obtained from Kinect, improving the performance of multimodal speech recognition. Studies by these pioneers have demonstrated the effectiveness of depth information in lip-reading recognition. Since the depth information is not affected by illumination, skin colour, etc. [14], the defects of the two-dimensional image information are compensated for. However, since the characteristics of the lips are usually obtained from discrete three-dimensional points or facial depth images, it is difficult to fully represent the characteristics of the lips.

The currently proposed lip-reading recognition based on 3D depth information does not consider the inherent texture problem of driving the lip motion during natural speech changes. In our previous work [15], to explore the internal mechanism of the speech process, we conducted an in-depth study on the facial texture information that drives the changes in lip reading and explored the facial texture information for lip movement changes in Chinese vowel pronunciation that have significant influence. However, since Chinese pronunciation is a strict tone-changing language, the transformation of the pitch has a significant role in the understanding of Chinese. Therefore, the exploration of Chinese tonal transformation in the current lip-reading research based on 3D information is important.

In this work, we focus on the study of the vowel tonal changes in Chinese pronunciation. Our main contributions are as follows. (1) For Chinese pronunciation tonal changes, we propose a new lightweight network framework, the SCNet, which is more sensitive to the transformation of details compared with the traditional network architecture. (2) We explore in detail the important influence of our proposed deep facial texture information on the change of vowel tones in auxiliary lip reading. (3) In syllable recognition with the depth texture, the experimental results show the ubiquity and good performance of the SCNet model in integrated tone recognition.

The rest of this paper is organized as follows. Section 2 introduces the data collection and preprocessing. Section 3 presents the proposed model architecture. Section 4 introduces our experimental results. Section 5 summarizes our work and introduces the future work.

2. Data Collection and Feature Preprocessing

2.1. Data Collection

Eight native speakers of Chinese, four males and four females, served as the subjects. All the subjects used standard Mandarin pronunciations without any accent influence. In the pronunciation of Chinese, each syllable has four different pitch changes (tones 1–4). In fact, there is a fifth pronunciation types in Chinese pronunciation, which is the unvoiced sound (i.e., a special silent tone in Chinese pronunciation) commonly spoken in Chinese. In order to explore the effects of different pitch transformations, we eliminated the unvoiced sounds that are rarely pronounced in Chinese, so in the experiment each syllable contained only one of the four commonly used tones. In terms of experimental data, we collected 5 vowels (/a/, /e/, /i/, /o/, and /u/) and 5 syllables (/ta/, /te/, /ti/, /fo/, and /tu/), a total of 40 tones. During the recording process, each tone was pronounced 10 times per person. For example, four tuned syllables (//, / á/, //, and /à/) were obtained by combining four lexical tones with the atonal syllable /a/.

The data acquisition device used a Microsoft Kinect V2 face real-time tracking camera and this camera through facial key points to generate real-time 3D point clouds (1347 facial key points). In [16], Mallick et al. have proved that the muscles of the facial expression recognition based on point cloud is successful, and it has been verified that the generation of 3D face point clouds is related to muscle distribution. At the same time, their experiments show that the shape of the face of point cloud generated face has nothing to do and can be very stable in different faces of the same position. Meanwhile, [17, 18] also prove the stability and effectiveness of Kinect V2. To ensure its quality, we collected the data in a standard silent room. The data collection scenario is shown in Figure 1.

During the process, we reindexed the 1347 points. The index of feature points in the lip area is shown in Figure 2(b), which used only the collected image information and 3D depth information. By considering the changes in the head model during movement, we corrected the head rotation angle in the , , and directions. As an example, the angle between vector ( and are two points in Figure 2(b)) and plane XY is calculated as follows:where and are the coordinates of and and and represent the coordinate point numbers on the plane XY. The rotated face point coordinates parallel to the XY plane are constructed by the following algorithm.

Finally, we acquired the standard point set of the real speaker’s face.

2.2. Feature Preprocessing
2.2.1. Image Feature Preprocessing

For the collected image information, we used the open source OpenCV lib library to intercept a lip region of interest, as shown in Figure 4(a), and then used the image sequence representation method proposed by Saitoh et al. The pronunciation of the syllables extracts 16 consecutive frames in the middle of the pronunciation to form a continuous sequence of image lip motion changes (, from left to right, top to bottom) and uses a gamma transform for light enhancement to augment the data, as shown in Figure 4(b) (take 16 sheets and then sort).

2.2.2. Muscle Dynamics Features

According to this study, there are six main types of muscles that drive lip movement in facial muscles. The distribution of the facial functions and characteristics of each muscle are presented in Tables 1 and 2 reflect the specific names of each muscle and the characteristic point identification of each muscle in the kinect data. In the specific depth texture feature representation, we extracted the two most representative depth, muscle length change, and muscle dynamic characteristic data points.

(1) Muscle Length Change Information. The length feature is expressed as , where represents the muscle length vector at the time of speech and represents the muscle length vector at the time of relaxation, which eliminates the differences between different speakers.

(2) Muscle Dynamics Information. The muscle dynamics information characterizes the relationship between the facial muscles and facial feature points and reflects the intrinsic commonality between different speakers. We also analysed the effects of different muscles on the displacement of the feature points as the drivers of muscle dynamic transformation. Regarding the feature information, the vector variation between the muscles is obtained by calculating the transformation trend of different feature points in adjacent frames. The specific expression is as follows:where represents the momentum change of the feature point , and represent the start and end points, respectively, of the muscle , and the direction of the muscle movement at each point is represented by decomposing the displacement subvector of each point. Indicates the length of movement of each muscle point.

3. Network Architecture

Considering the subtle differences in the mouth shape changes in Chinese tonal changes, we designed a lightweight skip convolutional structure network (SCNet) with subtle descriptions of feature changes to evaluate our proposed 3D lip features and to explore the feasibility of tonal changes and syllable lip-reading recognition. The overall architecture is shown in Figure 3.

The network architecture was inspired by that of VGG [19] and ResNet [20]. In the initial phase of the network, we used three convolutional layers with a stride of 2 to extract the surface features of the image. This network structure reduces not only the overall parameters of the network but also the accuracy loss of the feature map.

The main body of network structure is two connected feature extraction blocks, and they different from the current remaining block structure. Two subconnection blocks adopt different subsampling expressions. At the back of block 1, to make the edge features more obvious, the maximum pool was used to indicate the specificity of different features, highlighting the features of different feature maps. And, at the block 2, to make the features, the map was associated with the specificity of the feature maps more smoothly and effectively using global average pooling. The two connection block structures in the frame were slightly different. In the second block, to maximize the smoothing effect after block 1, the last convolutional layer output channel in block 2 was doubled and the rest was the same as that of block 1. This structure also showed good performance in the experiment. At the last end, a 128-dimensional linear layer was connected, and then the classification probability was obtained.

3.1. Skip Convolution Structure

We used a skip connection in each block. The structure of each block is shown in Figure 2(b), and the connection of each block is defined as follows:where and represent the input and output, respectively, of each block and represents the learning function of direct connection. As Figure 3(b) shows, the direct connection is composed of three convolution layers, so is specifically expressed as , in which is LeakyReLU and is the skip connection, which represents the connection structure of a layer and is given by the formula . Since the regularization layer was introduced, to reduce the parameter changes in this architecture, the bias item was not led into. Finally, the operation represents the direct weight addition of the direct and skip connection, rather than the corresponding result splicing.

Equation (4) is mainly divided into two parts: direct connection structure and skip structure. In the stage of direct connection structure, first we used a convolution, followed by a convolution, with a stride of to obtain more detailed feature information, and then the network optimization is connected to a convolution kernel, with a stride of 1 to simulate the processing of the Sobel matrix on the feature boundary. This structure makes the boundary features more obvious, so that the feature was better characterized in the feature judgement area. In the skip module, we used a convolution block, with a stride of , and the number of channels was increased. This procedure generates the same channel for the network, and the same size is more convenient for feature stitching. This method also ensures the fusion of the image on the feature structure. The purpose of the traditional Res block is to ensure the characterization of the local structure and the global feature to make the network structure more representative. We use this structure to consider that the convolution has retained the global feature, using a convolution. This convolution ensures the multiscale representation of the network structure.

3.2. Feature Fusion Structure

The expression for feature fusion structure is given as follows:

To better integrate the depth information and picture information, we adopted a decision fusion method to deeply integrate the two different kinds of information. The specific expression is shown in formula (5), where represents the 128-dimensional information acquired by the SCNet. The depth feature, represents the depth feature of the shallow stitching after two layers are fully connected, and indicates the fusion strategy. Thus, the feature, , after the fusion of the two, was decoded by a linear layer of one layer and output.

3.3. Implementation Detail

In the experiment, the input size of our image is . Since the image was adjusted before input, no corresponding data enhancement method was used during the experiment. Batch normalization (BN) [21] was adopted in the network after each convolution, before activation and after the BN. For the network weights, the random initialization method was adopted and the network was trained from zero. An Adam optimizer was used in the experiment, and the small batch size was set to 30. The learning rate started at , and the expression of the learning rate attenuation functions is shown in the following formula:where represents the last round of the learning rate, iterations decay once, and each damping coefficient is times . We did not use dropout during the implementation.

4. Experiments and Results

In the experiment, to verify the smoothness of the proposed model on the whole dataset, we set the experimental scheme to a five-fold cross-validation and calculated the average of all the results as the final experimental result.

4.1. Cross-Validation

To ensure the full use of the data and the accuracy of the experimental results in our experiments, we designed a 5-fold cross-validation. We randomly divided all the experimental data into 5 parts. Water sampling was used for the data division. The data in each sample set consisted of only 1860 groups. Four tests were used to train one test, and the experiment was performed for a total of 5 rounds, so that each could be used as the training set and test set and each experiment would give an independent result.

Because vowels play a leading role in the whole pronunciation process, in the experiment, in order to verify the difference between the entire syllable recognition effect and the different syllable recognition performance of each syllable, we first aimed at each vowel recognition accuracy was discussed, and then further analysis of tone recognition of vowels with different tones. By using different speech expressions, we ignore the unvoiced sounds in Chinese pronunciation to verify that our proposed SCNet has considerable experimental results in terms of accuracy of tone recognition and accuracy of the entire syllable recognition.

4.2. Vowel Detection and Vowel Tone Detection

We first verified the validity of our proposed model and compared it with the traditional models (VGG, ResNet, DenseNet [22]); in addition, we tested the effects of the different models on vowel recognition and vowel tone recognition. To ensure the fairness of the comparison, a linear layer and a softmax classification layer were added to the traditional model, and the optimal values the parameter settings were selected.

Figures 5 and 6 show the single vowel recognition results and the vowel tone recognition results, respectively. By comparing the two images quantitatively, we found that all the models showed good recognition performance; specifically, the proposed vowel distinction SCNet reached a recognition rate of almost 100%, and the tone recognition effect was significantly higher than that of the traditional model structure. A comparison of the overall results of several models in terms of the network depth, parameters, and accuracy is shown in Table 3. It was found that the SCNet gave the optimal values of the three parameters, especially those of the parametric variables. Compared with those of the previous models, the SCNet parameters were only 1/50 of the VGG value, 1/4 of ResNet value, and 1/3 of DenseNet value and even more advantages of the experimental results. These results indicated that our designed model was advantageous for processing real-time data and had better performance than that of the existing traditional framework.

Our analysis of this experimental phenomenon is based on the application of the SCNet architecture to the transformation of subtle differences in the datasets. This architecture showed good results for the description of the data details.

As a whole, the experiment can show such excellent results and attribute the success to the following characteristics of the network structure: (1) in tone recognition, the degree of differentiation of the mouth shape between different tones of the same syllable is very small, and we used a 3 ×  filter in the experiment. The use of such a small convolution kernel can enhance the fine feature structure discrimination. (2) Based on several previous verifications, it was proved that skipping convolutions can preserve the feature transformations between feature maps, which in addition is more conducive to the propagation of gradients than are traditional direct connections. The jump connection proposed in this paper showed that our method can capture more delicate network structure features and thus improve the fine discrimination performance. (3) Different downsampling methods between different structural blocks can be used in feature selection, highlight the propagation between different features, and make the network structure smoother, which is more conducive to the expression of different detailed features.

4.3. Texture Depth Information Fusion

To better verify the validity of the depth texture information in tone recognition, we designed a series of experiments to confirm the correctness of our conjecture.

The results of the tone recognition of the picture only and the tone recognition after the fusion of the depth information are shown in Figure 7. The experimental results showed that after fusion of the texture depth information, the recognition result of the image-only tone recognition increased by 2%, and especially in the case of low picture recognition rate, the effect on the tone recognition was obvious, which indicated that our proposed 3D depth texture information significantly influenced the auxiliary tone recognition. This effect occurred because image-based features are not sufficient to fully represent continuous lip motion. The feature tone recognition of colour images is sensitive to light, speaker skin colour, and camera acquisition quality. However, 3D information has good antiinterference for this kind of disadvantage and is hardly affected. Our proposed facial texture depth information largely compensates for the defect of lip pronunciation in tone recognition caused by environmental problems and complements the image-only lip pronunciation method.

Figure 8 shows the results of the model recognition for adding different noise types. In the experiment, the random Gaussian noise with the and was added to simulate the recognition scenario for different photographic definitions, and the gamma algorithm with the gamma was used to adapt to changes in the lighting due to real-life changes. Adding such dynamic noise can better reflect the robustness of different models in natural scenes and the ubiquitous ability of different frameworks. Unexpectedly, the performance of the proposed SCNet model was much higher than that of the traditional model, which shows that our framework has better application performance in real-world scenarios. Similarly, for the performance of the recognition effect before and after the texture depth information, there was a stable improvement effect of more than 0.5% after the fusion of the depth information, indicating that the fusion depth information is more meaningful for the recognition of the real scene.

4.4. Syllable Recognition

Since tone change occurs in all Chinese pronunciations and the consonant is attached to the vowel, the difficulty of syllable recognition is greater than that of the vowels. To further verify the effectiveness of our proposed SCNet in the recognition of all Chinese tones, we also verified the performance of the model in the recognition of 40 mixed tones based on 5 vowels (/a/, /e/, /i/, /o/, and /u/) and 5 syllables (/ta/, /te/, /ti/, /fo/, and /tu/).

The recognition results are shown in Figure 9. Although the pitch recognition of syllables is more difficult according to the theory, our SCNet model was robust, and a high recognition rate of 97.364% was obtained, indicating that our model had not only a good vowel tone recognition performance but also an excellent Chinese tone recognition performance. Moreover, after adding the depth texture information, the average recognition result of the pitch showed a 0.2% improvement. Since the pronunciation of the syllable is more complicated than that of the vowel and the pronunciation organ is more involved, the facial depth may be relevant. Texture information has a greater impact on the recognition of syllables. A comparison with our previous conjectures indicates that deep texture information has a very clear effect on the recognition of the Chinese lip to assist in lip reading for both consonant and vowel tone recognition.

5. Summary

This work was mainly focused on the difficulty of tone recognition in Chinese lip-reading recognition. In this paper, we designed an efficient lightweight network framework, SCNet, based on a comprehensive and effective lip-reading feature extraction method and verified the effectiveness of our proposed network framework by several experiments. In the study, we carried out an in-depth verification on the proposed framework. Comparison experiments showed that the framework can accurately identify the tones of Chinese pronunciation. In addition, the facial texture depth information and picture information fusion demonstrated the feasibility of facial texture depth information to help the recognition of Chinese tones.

With the wide application of depth cameras on video equipment, lip reading will better assist speech recognition in the future and improve the robustness of speech recognition in different environments. The dataset used in this paper consisted of independent syllables, but the results show that the proposed method is practical and can be effectively applied to future large-scale datasets.

Data Availability

The data used to support the findings of this study are available from the first author upon request.

Conflicts of Interest

The authors declare no potential conflicts of interest with respect to the authorship and/or publication of this article.

Acknowledgments

This study was financially supported by the National Natural Science Foundation of China (grant no. 61977049) and by the Tianjin Key Laboratory of Advanced Networking.