Abstract

Nonnative Mandarin speakers always have some unnatural pauses when speaking Mandarin due to their native pronunciation habits. Accurately predicting the prosodic structure of Chinese sentences is the key to improving fluency in Mandarin for nonnative speakers. This paper investigated the influence of the Chinese prosodic boundaries on the Mandarin spoken by international students. First, we proposed a new method to predict the prosodic word and prosodic phrase boundaries from Chinese sentences to obtain the prosodic boundaries automatically. Then, we used the predicted results to improve the Mandarin spoken fluency of international students. To train the prosodic boundary prediction model, we firstly constructed a Chinese prosodic boundary corpus that includes 100,000 Chinese sentences with manually labeled prosodic boundaries under the guidance of a linguist. We also proposed an end-to-end Chinese prosodic boundary prediction model based on the sequence-to-sequence model with a new feature named number of syntax hierarchy (NSH). Finally, we assess the fluency score of Mandarin using 1300 utterances recorded by six international students and a native Mandarin speaker. The utterances are recorded without/with the predicted prosodic boundaries. The experimental results show that the scores of the prosodic word prediction model and the prosodic phrase prediction model are 98.14% and 85.24%, respectively. The fluency assessment results show that the fluency score labeled with the prosodic boundaries is higher than the fluency score of the international students when they read freely. The improvement of the score is between 7 and 16. Therefore, our method can be applied to the Mandarin education system to improve the spoken Mandarin fluency of nonnative speakers.

1. Introduction

Popularization and learning of Chinese for international students are now one of the priorities of higher education in China, with the gradual increase in the number of international students coming to China. However, because of the particular prosodic structure of Chinese, international students with poor Chinese proficiency are always not fluent in Mandarin. The prosodic structure of Mandarin is mainly reflected in speed and pause. An effectively organized prosodic structure contains the emotions and thoughts of the person when the sentence is expressed [1]. It is known that prosodic boundary can divide continuous speech into several prosodic units of various sizes to produce correct pauses in sentences, which directly affects the understanding of speech. For example, “打/死老虎” (hit the dead tiger) and “打死/老虎” (kill the tiger) have different meanings because of the pauses produced by the different prosodic boundaries [2].

Different languages have different prosody segmentations, and the same language will also produce different prosody hierarchies under different segmentation methods. For example, the English prosodic structure can be divided into phonological utterance, intonation phrase, phonological phrase, prosodic word, foot, syllable, and mora in descending order of syntactic standards [3]. According to the intonation mode, the English prosodic structure can be divided into the middle phrase and intonation phrase [4]. For the Chinese prosodic structure, Li [5] followed the prosody structure of English to divide the Chinese prosody level from small to large into the foot, prosodic word, minor prosody phrase, major prosodic phrase, and intonation phrase. Zheng [6] made six division levels in more detail: degenerate syllable, normal syllable, minor phrase, main phrase, breath group, rhythm groups, etc. From the perspective of phonology, the prosodic hierarchy structure in small-to-large includes mora, syllable, foot, intonation segment, and tone group segment [7]. Chinese scholars generally agree that the prosodic hierarchy structure is mainly divided into three levels, including prosodic word, prosodic phrase, and intonation phrase [8]. In general, prosodic words refer to several closely and continuously connected syllables in pronunciation. A prosodic word is usually composed of two or three syllables and no pause within the prosodic word, while the prosodic phrase consists of one or a few prosodic words, with a relatively stable phrase intonation mode and phrase stress configuration mode. There are usually apparent pauses between intonation phrases. In oral communication, the listener’s understanding of the meaning of an utterance can be obtained by the speaker setting prosodic boundaries to pause or accent while speaking [9].

In the past, many researchers used rule-based methods or traditional machine learning-based methods for predicting Mandarin prosodic structure. The C4.5 and transformation-based learning (TBL) [10] are typical rule-based learning algorithms. The maximum entropy model (ME) [11] and conditional random field (CRF) [12] are used to predict the prosodic phrase boundary. The rule templates of these methods are determined manually. Although the system is relatively simple and easy to understand or practice, the manual operation has many limitations. In recent years, deep learning-based methods have been widely used for prosodic boundary prediction, like recurrent neural networks (RNN), bidirectional long-short memory (BLSTM), and the BLSTM-CRF model formed by combining BLSTM and CRF [1316]. Du et al. [17] applied the self-attention mechanism [18] to the task of prosodic structure prediction. Pan et al. [19] proposed a mandarin prosodic boundary prediction model based on multitask learning (MTL) architecture. Lu et al. [20] model the relationships between prosodic boundaries and lexicon words for prosodic boundary prediction by combining self-attention with MTL and setting word segmentation as an auxiliary task. Because the prosodic structure of Mandarin is related to the syntactic structure [21], the traditional shallow linguistic feature has been augmented or replaced by embedding features and some syntactic features [2226]. There is still room to improve the prosodic prediction of Chinese and use the predicted result to perfect international students’ speaking Mandarin.

The paper proposed a new Chinese prosodic boundary prediction method to help international students improve fluency in speaking Mandarin. We firstly constructed a corpus for modeling the Chinese prosodic structure. Then, we proposed a new method to predict the boundaries of both prosodic words and prosodic phrases of Chinese sentences. Finally, we use the predicted result to help international students studying Mandarin. The contributions of this paper are as follows: (1)We constructed a Chinese prosodic boundary corpus with the prosodic words (PW) labeling and prosodic phrase (PPH) labeling of the lexicon under the guidance of a linguistic expert(2)We proposed an end-to-end Chinese prosodic prediction method by adding a new feature named the number of syntax hierarchy (NSH)(3)We access the results of prosodic boundary prediction for improving the fluency of international students in speaking Mandarin

The rest of the paper is organized as follows. We first introduce the temporal sequences modeling in Section 2. Then, we present our framework for assessing fluency of international students in speaking Mandarin and explain each module in Section 3. The experimental setup and experimental results are presented in Section 4, while the discussion of the results is included. Finally, a brief conclusions and future works are provided in Section 5.

2. Temporal Sequence Modeling

For temporal sequences modeling, RNN always plays an important role. However, the standard RNN structure has limited ability to model long-term dependencies. Therefore, a variant of RNN, named long short-term memory (LSTM), is proposed to solve the problem effectively. This section presents a brief description of the LSTM and BLSTM networks and introduces how attention mechanisms solve the problem of hidden state information loss due to LSTM networks.

2.1. LSTM

The key components of LSTM are memory cells and gates. Figure 1 shows the memory cell of a single LSTM. This structure can retain information at many time steps and can effectively capture long-term time dependence. The calculation details of LSTM are as follows: where is the Sigmoid function and , , , , and are the output of forget gate, input gate, cell memory, output gate at time. and stand for hidden layer outputs and input vectors at time. and form the input vectors for the current moment, and are the weight parameter matrix and the bias vector. represents the element-wise product.

2.2. BLSTM

The disadvantage of LSTM is that it can only access previous inputs. BLSTM uses a bidirectional structure to access the presiding and succeeding inputs. We unfold the time step of the BLSTM forward and backward, as shown in Figure 2. Every box represents an LSTM memory cell. Forward LSTM read input sequence and calculate hidden state , backward LSTM read sequence and calculate hidden state .

2.3. Attention

The attention mechanism is mainly proposed to deal with the problem of information loss in the hidden state [27]. Given the input sentence, the target sentence is generated after encoding-decoding operations, that is, we calculate the conditional probability of each possible word to search for the most likely word. The formula is as follows:

where represents the hidden layer state of the decoder at time: where is the vector obtained by adding the hidden vector sequence in the encoding process according to the weight.

Since the encoder uses a BLSTM network, not only refers to the th word in the input information but also the information before and after the word. The vector sequences of these hidden layers are added according to their weights, and they have different proportions of attention distribution when generating the th output, as follows: where mainly evaluates the relationship between the input of position and ; it mainly relies on the hidden state in the LSTM and the label in the input sequence. The larger of , the more attention the output allocates to the input, and the greater the influence of the input when generating the output. Taking as the parameter of the forward neural network enables the gradient of the loss function to propagate in different directions and be trained with the model.

3. The Framework for Assessing Fluency of International Students in Speaking Mandarin

To assess the effect of prosodic boundaries on the fluency of international students in speaking Mandarin, we further our work of [25] to predict the prosodic boundaries of Chinese sentences, as shown in Figure 3. First, in the model training stage, we performed text feature extraction on the Chinese sentences to extract features for the prosodic boundary prediction model training. Then, in the prediction stage, the extracted features of input sentences are fed into the model to predict the prosodic boundary labels. Finally, in the speech assessment stage, we used the 1300 recorded sentences to obtain the spoken Mandarin fluency scores of six international students by the speech evaluation system and analyzed the results.

3.1. Text Corpus with Prosodic Boundary Labeling

We selected 100,000 Chinese sentences from the “People’s Daily” in 1998 and 2000 as the original text corpus. The corpus mainly consists of news and information, including social, financial, military, history, culture, science and technology, automotive, real estate, sports, entertainment, and health.

Through the statistical analysis, each sentence contains an average of 51.46 syllables and 25.73 grammatical words. The original sentences are first segmented into Chinese words and automatically labeled the POS with a lexical analysis tool. Then, we manually labeled prosodic boundaries (prosodic word boundaries and prosodic phrase boundaries) guided by a linguistic expert according to a labeling specification drawn up by the linguists. Figure 4 shows the labeling process of the corpus. In labeling, linguists randomly checked some sentences for review and correction. As a result, we have achieved a high degree of labeling precision on prosodic boundaries of consistency with linguists through our continuous corrections.

3.2. Feature Extraction

We use the shallow semantic features, including the part-of-speech (POS) of lexicon word, lexicon word length (WL), lexicon word embedding (WE), and a new deep syntactic feature named the number of syntax hierarchy (NSH) as the features to train the prosodic boundary prediction model. We designed a feature extractor to obtain all features, as shown in Figure 5. The Chinese input sentence is first normalized with the text normalization algorithm. The sentence is then automatically segmented into lexicon words to obtain the WL and WE. Finally, a grammar rule library and a Chinese dictionary are employed to obtain each lexicon word’s POS.

Because there is a strong relationship between grammatical words and prosodic boundaries, we conduct a syntactic analysis [28] of the normalized sentence again to obtain the syntactic tree of the sentence for calculating the NSH. The NSH is the height of the lexicon word on the syntactic tree that determines the sentence’s syntactic hierarchies. The larger value of NSH takes the higher syntactic hierarchy and indicates the greater probability of the appearance of prosody boundaries. An example of calculating the NSH is shown in Figure 6.

3.3. Architecture of the Prosodic Boundary Prediction Model

We train the prosodic boundary prediction model by adopting the sequence to sequence (seq2seq) model based on the encoder-decoder structure [29], as shown in Figure 7. The encoder changes the input information of the model into a fixed-length semantic vector to compress the information. However, the encoder will lose the initial information of the sequence in information compression. As a result, the last generated meaning vector contains incomplete information so that the prediction sequence generated during decoding is inaccurate. Therefore, we added an attention mechanism to improve the prediction accuracy.

We use the BLSTM as the model’s encoder while the LSTM as the decoder. After a word embedding layer, we feed the feature vectors obtained from the feature extractor into the encoder. The encoder generates a hidden state for every time step and initializes the initial hidden state of the decoder. Next, the attention mechanism calculates the correlation between the hidden state of the decoder and all the hidden states of the encoder to obtain different attention weights. Finally, the decoder reads the hidden state of the sequence forward and generates a prosodic boundary label sequence represented by 0 or 1.

3.4. Fluency Assessment

Since this work wants to verify whether the correct prosodic boundary can help improve international students’ fluency in speaking Mandarin, we also ask six international students to be the subjects for recording Chinese sentences. The subjects all have an undergraduate degree and are between 19 and 23 years old and studied Chinese for one to two years, so they have an introductory level of Chinese. In order to compare with the level of native speakers, we also invited a native Mandarin female graduate student with the Mandarin Proficiency Test secondary-level A certificate as the speaker to record the same sentences. We selected 100 sentences based on syllable phoneme coverage to record. All sentences are automatically labeled prosodic boundaries with the trained prosodic prediction model. All speakers are first asked to read the sentence without labeling prosodic boundaries according to their understanding and then read the same sentence with prosodic boundary labeling according to the prosodic boundaries.

Because international students are unfamiliar with Chinese characters, the reading text is presented to them in both Chinese and Pinyin to avoid not being fluent in speaking caused by understanding the sentence. Before recording, each participant had 10 minutes to become familiar with the text, ensuring there were no text recognition and understanding barriers. In the recording, each subject is asked to keep reading at a constant speed as much as possible to reduce the influence of the difference in speaking speed on the experimental results. As a result, we finally recorded 1300 utterances. All recordings were first saved in the Microsoft Windows WAV format as sound files (monochannel, signed 16 bits, sampled at 16 kHz) and then fed into the iFlytek Speech Assessment System (https://www.xfyun.cn/services/ise) to obtain the required fluency score. The iFlytek Speech Assessment Technology, which can score the speech on a scale of 0 to 100 based on the percentage of incorrect pauses, has been approved by the State Language Commission and has reached the practical level.

4. Experiments

We conducted several experiments to evaluate the proficiency of the prosodic boundary prediction model and assess the effects of the prosodic boundaries on the fluency of Mandarin read by international students.

4.1. Experimental Setup

We combine original word embedding (WE) and prosody features for the prosodic boundary prediction model to get a new embedding. Original WE are trained by the result of automatic word segmentation with 74,497 words. The context window size is five during training, and the word embedding dimension is 128. Finally, we concatenate the POS, WL, NSH, and original WE by the last dimension as the eventual input of the model.

Three kinds of models, including BLSTM, seq2seq, and seq2seq+attention, were compared in the experiments. The architectures of these frameworks are shown in Table 1. The batch size of the three models is 64, the learning rate is 0.003, and the decay rate is 0.2.

4.2. Experiment Results on Prosodic Boundary Prediction

We used precision (), recall (), and score () to evaluate the experimental results. Precision refers to the ratio of correctly identified prosodic boundaries to the total identified prosodic boundaries. The high precision takes low misrecognition. Recall refers to the ratio of correctly identified prosodic boundaries to the total prosodic boundaries in the test set. The higher recall takes lower missed identifications. We used score to reconcile and , as defined in

Predicting PW and PPH by the proposed the seq2seq+attention prosodic boundary prediction model is compared with the BLSTM and seq2seq model, as shown in Table 2. The scores predicted by the seq2seq+attention model for PW and PPH reached 98.14% and 82.88%, respectively. We can see from Table 2 that the proposed prosodic boundary prediction model achieves the highest score on both PW and PPH.

To evaluate the effect of the NSH on PPH prediction, we compare the results of the proposed seq2seq+attention model by adding/removing the NSH, as in Table 3. We can see from Table 3 that the score of the seq2seq+attention+NSH model is improved by 2.36% compared to the seq2seq+attention model. It proves the effectiveness of the proposed NSH feature for PPH boundary prediction.

We also compared the PPH prediction performance of our model with others’ models after adding NSH features using the same data set, as shown in Table 4. We can see from Table 4 that the proposed seq2seq+attention model with NSH reaches the highest score.

4.3. Fluency Assessment

In order to analyze the influence of prosodic boundaries on fluency for international students speaking Mandarin, we conducted two experiments with the iFlytek Speech Assessment System. The first experiment used the recordings without prosodic labeling, while the second used the recordings with prosodic labeling.

In the first experiment, we assess each international student’s preliminary fluency in Mandarin (S1 to S6), as shown in Table 5. We can divide the international students into two groups: poor reading fluency (S1 to S3) and good reading fluency (S4 to S6), based on the statistical results of Table 5.

Figure 8 shows the fluency score of S3 and S6. We can see from Figure 8 that the fluency score for native speakers is relatively stable, with an average score of 81.52. International students’ fluency scores are generally lower, fluctuating more around the average score. After the 70th text sentence, the fluency score went down significantly. The possible reason is that the predominance of long texts after the 70th sentence is more difficult for international students. In addition, another possible reason is incorrect stopping positions for words and phrases during reading.

In the second experiment, we assess the effect of the prosodic boundaries on the fluency score for international students, as shown in Table 6. Similarly, we selected S3 and S6 for visual analysis of fluency scores, as shown in Figure 9. The statistics of the six international students’ final assessment results are shown in Figure 10. For international students, labeling the prosodic boundaries can improve their fluency scores notable.

However, there is still a gap between them and native speakers. We believe the reasons are as follows: (i)The international students did not study Chinese for a long time and were at the elementary Mandarin level. They have not yet been able to systematically grasp Chinese phonetics, grammar knowledge, prosodic characteristics, and the relationship between Mandarin pronunciation, pause, and semantic expression(ii)The international students lacked Mandarin prosodic awareness. Because the international students lack the understanding of Chinese prosody, they cannot handle the relationship between Mandarin tone and intonation well

After conducting simple prosody training related to “prosody pause” for international students, their oral fluency was greatly improved through the speech evaluation experiment. The improvement score is between 7 and approximately 16. The speech assessment results are consistent with the expected results. It proves that our work can help nonnative Mandarin learners to grasp the prosody structure better and improve their fluency in speaking Mandarin.

5. Conclusions

This paper assesses the influence of the Chinese prosodic boundaries on speaking Mandarin for international students. We proposed a novel Mandarin prosodic boundary prediction model based on the seq2seq with an attention mechanism for helping international students learn Mandarin. The model uses a new feature named the NSH to predict PPH. The experimental results show that the proposed model can improve the accuracy of the prosodic boundary prediction. The proposed new feature, NSH, also can further improve the PPH prediction accuracy. We also conducted a fluency assessment experiment with the utterance recorded by the international students read without/with the prosodic boundaries predicted by the proposed model, proving that our work could help international students better grasp the prosodic structure and improve their spoken Mandarin fluency. Because international students also find it challenging to master the Chinese accent, we need to study further the Chinese accent prediction method and the effect of accent on the improvement of international students’ Mandarin level in future work. At the same time, because the prosodic boundaries of Chinese have a hierarchical structure, we will further study the method of using hierarchical networks to predict the prosodic structure to improve the accuracy of prosodic boundary prediction.

Data Availability

The data used to support the findings of this study were supplied by the corresponding author under license and so cannot be made freely available. Requests for access to these data should be made to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62067008 and No. 31860285). Additionally, part of this work was performed in Promoting High-quality Education Development in Ethnic Minority Regions through Collaborative Innovation in Intelligent Education (key scientific research project for Double World-Class Initiative in Gansu Province) (Project No. GSSYLXM-01) and the Science and Technology Program of Gansu Province (Grant No. 21JR7RA117).