Abstract

In order to improve the pronunciation accuracy of spoken English reading, this paper combines artificial intelligence technology to construct a correction model of the spoken pronunciation accuracy of AI virtual English reading. Moreover, this paper analyzes the process of speech synthesis with intelligent speech technology, proposes a statistical parametric speech based on hidden Markov chains, and improves the system algorithm to make it an intelligent algorithm that meets the requirements of the correction system of spoken pronunciation accuracy of AI virtual English reading. Finally, this paper combines the simulation research to analyze the English reading, spoken pronunciation, and pronunciation correction of the intelligent system. From the experimental research results, the correction system of spoken pronunciation accuracy of AI virtual English reading proposed in this paper basically meets the basic needs of this paper to build a system.

1. Introduction

The virtual spoken English system has become an important English communication tool. With the continuous development of artificial intelligence technology, the AI virtual spoken English tool has gradually moved from theoretical research to real-world applications and the more widely used AI virtual English teaching pronunciation correction.

Most English speech synthesis models based on pronunciation mechanisms contain three main modules. Among them, the pronunciation movement model simulates the morphological structure of the pronunciation organ, the cooperative pronunciation model simulates the dynamic characteristics of the pronunciation organ, and the acoustic model simulates the aerodynamic process to generate the corresponding English speech signal. Any inappropriate approximation of these three main modules will affect the English speech quality. We try to build a more accurate pronunciation movement model to approximate the morphological characteristics of the articulation organs, so as to get a better pronunciation synthesis system. At present, there are two mainstream modeling strategies that are physiological models and geometric models. The physiological pronunciation model uses the finite element method to simulate the biomechanical properties of soft tissue and is embedded in the muscle structure to drive the model. However, the physiological pronunciation model faces the high computational load of the finite element module and the inappropriate distribution of the pronunciation organs and related muscles, which makes the control of the physiological pronunciation model extremely complicated. The geometric pronunciation model models the contours of the vocal organs, and the shapes of the vocal organs and vocal tract can be directly controlled by a predefined parametric set [1]. This parametric set is obtained through statistical analysis. Compared with the physiological pronunciation model, the geometric pronunciation model does not care about the biomechanical properties of soft tissues and the unplaning function of related muscles, so the calculation cost is greatly reduced, and the control of the vocal tract shape becomes simple. Therefore, the geometric model is more suitable for occasions where there is no need to understand and analyze the internal structure of the pronunciation organs for English speech animation applications [2].

Although in most cases clear sound is sufficient to complete people’s basic communication, what visual information provides is a more effective and vivid communication effect. In addition, when the voice is missing or unclear, visual information can help people guess and understand what the speaker wants to express. For example, for people with hearing impairment, effective lip reading or speculation and judgment based on changes in the speaker’s facial expressions can help them understand the speaker’s meaning accurately [3].

Based on the above analysis, this paper combines intelligent voice technology to construct the correction system of spoken pronunciation accuracy of AI virtual English reading, explore the effectiveness of the model, and improve the correction effect of spoken English reading.

On the basis of speech visualization, speech-driven face modeling and animation technology are of great significance to improve the teaching effect of the multimodal Mandarin pronunciation teaching system [4]. In recent years, many 3D speaker simulation technologies have been proposed, which can be basically divided into the following six categories: based on vector graphics animation, based on raster graphics system for animation rendering, based on the data-driven synthesis, and based on anatomically modeling the head, based on deformation algorithm, and based on machine learning [5]. The technique 3D speaker modeling based on vector graphics animation uses a simple vector graphic animation to show the outline of the main facial articulation organs (mouth, tongue, teeth, and soft palate, etc.) [6]. The method 3D speaker modeling for animation rendering based on a raster image system uses complex polygons to form a human head model. The advantage is that the raster image system can provide a high rendering level and a more realistic head model. The disadvantage is that the time-varying motion parameters are difficult to calculate, and the raster image system is very expensive and animation renders a long time [7]. The method 3D speaker modeling based on data-driven synthesis uses digital image processing technology to extract features from digital images [8]. Literature [9] established a sound-to-speech reversal model based on generalized variable parameters-hidden Markov (GVP-HMM) to achieve 3D speaker modeling. Literature [10] modeled a 3D speaker from the anatomy of the head. Based on the physiological structure of the face, a muscle model is proposed, and the muscle vector is used to simulate the movement of the muscle to generate facial expression animation. The disadvantage of this method is that the muscle parameter derivation mechanism is very indirect, the measurement is also very complicated, and the control parameters of muscle characteristics are only partially visible. The 3D speaker modeling based on the deformation algorithm calculates the position of the deformation point of the entire face by capturing a small amount of facial control point displacement [11]. This method puts the face into a regular control grid, such as an N × N × N cube, and establishes the corresponding relationship between the cubic control grid and the object to be deformed. Finally, the control grid can be moved to obtain the deformation of the deformed object to control the local deformation and global deformation of the object to be deformed according to the local movement and global movement of the control grid. First, it calculates the coordinates of the point to be deformed relative to the neighboring control point and gets the position of the point to be deformed from the displacement of the control point [12]. Concerning the 3D speaker modeling based on machine learning, using artificial intelligence techniques such as machine learning to learn the correspondence between speech or text and the movement of the articulator and expression movement [13], using any speech or text to drive the 3D head model, this method avoids immersive real-person data collection. This method is currently in the research stage. There have been many studies on 3D speakers abroad. Literature [14] developed a FAP-driven facial animation Italian speaker head model based on the MPEG-4 standard, which is automatically trained based on real data. The three-dimensional kinematics information is used to create a lip joint model and directly drive the speaker head model. The virtual speaker ARTUR developed in [15] shows the movement of the tongue and teeth, the pronunciation organs in the oral cavity. The visual speaker developed in [16] uses an electromagnetic pronunciation capture device to collect five control points of the tongue, two control points of the soft palate, and six control points of the mouth to simulate developmental pronunciation. The model of the pronunciation organ is obtained by three-dimensional reconstruction of nuclear magnetic resonance images. Literature [17] developed a visual pronunciation system based on a physiological model based on the deformation of the biological characteristics of each muscle in the face and vocal tract to simulate the movement of the articulation organs. Literature [18] developed a face animation system, which uses text/speech as the driving data of the system and uses the hidden Markov model to extract features of the speech signal. The speech is represented by the Meyer Frequency Cepstral Coefficient (MFCC). According to the information, the keyframe sequence of audio-viseme mapping is obtained through MFCC training, and a real-time synchronized face animation system is obtained according to the mapping relationship. Literature [19] developed a text-driven 3D Chinese pronunciation system, collected pronunciation corpus through EMA equipment, trained pronunciation model and acoustic model based on the hidden semi-Markov model (HSMM), and obtained a 3D network of pronunciation organs through MRI. The lattice model realizes the Chinese pronunciation system of simultaneous pronunciation. Literature [20] realized the correct pronunciation animation of the 3D human model and chose to use the EMA data as the support and the Dirichlet Free Deformation (DFFD) algorithm to drive the 3D human head talking model.

3. Statistical Parametric Speech Synthesis Based on Hidden Markov Chain

Generally speaking, unit splicing technology often does not involve the processing of speech signals, and the quality of synthesized speech often depends on the database produced. Since this paper only focuses on parametric speech synthesis technology, nonparametric speech synthesis methods are not in the scope of this paper. Parametric speech synthesis technology mainly uses data to train the model, so that the model can learn the mapping function from text to acoustic parametric from the data set. Compared with nonparametric speech synthesis technology, in the prediction stage, it no longer depends on the data set, and the model directly synthesizes the text into speech. Among parametric speech synthesis models, statistical parameter speech synthesis based on hidden Markov chains is the most popular technology, which is generally divided into text analysis module, acoustic module, and vocoder synthesizer. The process of the statistical parameter speech synthesis model based on the hidden Markov chain is shown in Figure 1.

Character to phoneme conversion converts words into phonetic representations, which are generally described by phonemes. The prosodic unit is composed of adjacent phonemes, which can generally reflect the speaker’s mood in the speech and the mood of the sentence (declarative sentence, interrogative sentence, imperative sentence, etc.) and other pieces of information. The prosody feature embodies the pitch, length, and intensity of the speech. Adding prosodic information to the input helps to enhance the naturalness of the synthesized voice. Since the independent phonemes cannot model the context information, it is not conducive to the synthesized speech quality. Therefore, after the input is converted into phoneme, contextual information is often added to the phoneme information, which mainly includes phoneme, stress-related factors, and location-related factors. The common context information in English is shown in Table 1.

In the statistical parameter speech synthesis model based on the hidden Markov chain, the function of the acoustic module is to convert the phoneme-level context sequence output by the text analysis model into corresponding acoustic parameters, and the acoustic model often contains Mel cepstrum coefficients, basis frequency, and vocalization sign. Among them, the acoustic model is modeled by hidden Markov chains.

In the training stage, the acoustic parameters such as the cepstral coefficient sequence, the fundamental frequency sequence, and whether to pronounce the sequence in the audio are extracted through the corresponding signal processing algorithm. Each different context corresponds to a different state (hidden variable) in the hidden Marco chain and, at the same time, introduces the beginning and end substates in each state. The state is used to describe the context of prosody and linguistics. The acoustic parameter sequence corresponds to the observation value of the state in the hidden Markov chain, and the distribution of the observation value of each state is a multidimensional Gaussian mixture distribution. At the same time, the fundamental frequency information contains the fundamental frequency and whether to speak or not. Among them, the fundamental frequency is continuous, and the vocalization flag is discrete. Therefore, it needs to adopt multispace mixed distribution. It is worth noting that, according to the hidden Markov model, the probability of the duration of the state sequence is shown in

Among them, K is the total number of states of the hidden Markov model obtained from the input context sequence, is the parameter in the hidden Markov model, and is the probability that the state k continues for basic time slots, and its probability expression is

Among them, is the transition probability from state k to state k. This paper is based on the maximum conditional probability criterion. When calculating the maximum value of formula (2), the probability of the duration of each state will decrease proportionally as the number of continuous-time slots increases. Therefore, the duration of each state is 1 time slot. A hidden semi-Markov model is introduced to model the state duration, which makes the duration of each state obey a Gaussian distribution. And the mean and variance in the distribution of the duration of each state are the results after the most iteration in the forward-backward algorithm.

At the same time, the context sequence has different effects on acoustic parameters such as cepstral coefficients, fundamental frequency, and duration of the state. Therefore, it needs to establish a separate decision tree and corresponding problem set for each parameter. The training is based on the maximum conditional probability criterion. This will cause the output acoustic parameters of the model in a given state to be the mean value of the Gaussian mixture distribution, which will result in a large step between different states. As a result, the coherence of the entire synthesized speech deteriorates. In order to improve the coherence of synthesized speech, the first-order and second-order difference values of acoustic parameters are introduced into the observations of hidden Markov chains.

As the number of context factors in Table 1 increases, the number of factors that can be combined will increase exponentially as the number of factors increases. This leads to an exponential increase in the number of states in the hidden Markov chain, and training such a model often requires a larger training data set. At the same time, the increase in the number of states will more easily lead to the problem of uneven data distribution. For example, some combinations have a large number of training data, while some combinations have a small number of training data, which ultimately leads to insufficient training of the model. When there is no training in the test phase, the incorrect acoustic parameters predicted by the model will affect the quality of the synthesized speech. In order to improve the generalization performance of the model and solve the problem of data sparseness, decision trees are often introduced into the model. Each leaf in the decision tree corresponds to a state in the hidden Markov chain. In the process of training the decision tree, the pruning strategy is adopted, and some leaves in the decision tree will correspond to multiple contexts so that the final state number of the model is reduced, the data distribution space is reduced, and the model training is more adequate. In the prediction phase, when encountering a context without training, the model can also determine the state corresponding to the context according to the decision tree. According to the training data set to train the parameters in the hidden Markov chain model, the mathematical expression of the maximum conditional probability of the observation is shown in formula (3). Among them, , and bq are, respectively, the acoustic feature (Meier cepstrum coefficient, fundamental frequency) sequence, context feature sequence, parameters in the hidden Markov chain, and the state in the hidden Markov chain, state transition probability, and observation state probability.

During training, the number of frames T of the observation value and the state sequence of the hidden Markov chain are known, and the forward-backward algorithm and the EM algorithm are used to obtain the parameters . In the testing phase, this paper first analyzes the text input required, extracts the context sequence, and obtains the hidden Markov model sequence according to the context sequence and the duration of the corresponding state of each context. Subsequently, the state duration is used to adjust the hidden Markov model sequence into a frame-level sequence, combined with the speech parameter generation algorithm I col to obtain smooth acoustic parameters, Finally, a vocoder is used to synthesize speech.

Given a text input, the context sequence obtained from the input text is , and the probability that the model generates observations (the observations correspond to the acoustic parameters) is shown in

Among them, is the state sequence, is preset, which determines the duration of the synthesized speech, and is the acoustic model based on the hidden Markov chain.

The vocoding synthesizer restores the frame-level acoustic features (fundamental frequency, Mel cepstrum coefficients) predicted in the acoustic model to the time domain signal through a digital filter, that is, the final speech. Its mathematical expression is as follows:

Among them, x (n) is the synthesized speech signal, h (n) is the formant filter, whose parameters are determined by acoustic parameters such as the Mel cepstrum coefficient, and e (n) is the excitation signal, which corresponds to the output of the acoustic model.

Models based on deep learning have gradually emerged in many fields such as image classification and segmentation, video understanding, machine translation and understanding, and speech recognition and synthesis, and their performance indicators have been constantly refreshed. It is worth noting that after the size of the data set increases to a certain extent, the performance of traditional machine learning algorithms no longer increases significantly as the size of the data set becomes larger, as shown in Figure 2.

Due to the great similarity between adjacent points of the speech signal, this will bring great redundancy to the model implicitly learning the alignment of text and audio. Therefore, 80 power values in the specified frequency range under the Mel scale are used to represent 1024 points in each frame. This article assumes that T is used to represent the number of slots of the decoder; then, the decoder will eventually generate a prediction value of dimension. The output of the decoder is processed by the postprocessing module to synthesize the final speech. The rest of this section will be expanded with the encoder, decoder, and postprocessing network in the Tacotron model. The overall model structure of Tacotron is shown in Figure 3.

The decoder of the Tacotron model is composed of a cyclic neural network and a preprocessing module. It mainly uses the output of the encoder in the Tacotron model as its input to predict the acoustic parameters at the next moment. The cyclic neural network is composed of a cyclic neural network combined with the attention mechanism and a two-layer cyclic neural network, as shown in Figure 4.

The Tacotron model splices the context obtained by the attention mechanism with the decoder output value at the previous moment and uses it as the decoder’s next moment input (the spliced value may need to be projected to the specified dimension using a fully connected neural network). The structure of the attention mechanism in Tacotron’s decoder is shown in Figure 5, and the specific mathematical operation process in the attention mechanism is shown in formulae (7) to (11).

In formula (7), is a score function, which is used to calculate the similarity between the state value s at each time in the decoder and the output value h at each time in the encoder. The common form of the score is shown in

In formula (9), ci is the context information at time i. In formula (10), is the splicing of the output value and of the decoder at time , which is used as the input of the decoder at time à, c is the input of the decoder at time i, and f is the cyclic neural network. In formula (11), is a fully connected neural network, which takes the main state value and splicing value of the encoder as input to obtain the output of the decoder at the current time.

The Seq2Seq model combined with the attention mechanism makes it have longer sequence modeling capabilities. But with different attention mechanisms, there will be different performances under different tasks. Formulae (7) to (11) correspond to the attention mechanism in the Tacotron model, that is, the output value gy of the decoder at each moment. And its solution process is as follows. First, according to the state value of the decoder at the previous time and the output h of the encoder, the attention mechanism is used to obtain the context information , and then and n are used as the input of the recurrent neural network to obtain . Finally, and are used as the input of the fully connected neural network to predict the output value of the decoder at time i. In addition to the attention mechanism used in Tacotron, researchers have also designed other forms of attention mechanism. The solving process of the output value of the decoder at each moment is as follows. First, it clarifies the state value s of the decoder at the current moment and the output h of the encoder, then obtains the context variable according to the attention mechanism, and then uses and to obtain . In addition, a location-based attention mechanism is proposed, as shown in

A local attention mechanism is proposed; that is, the decoder only pays attention to the local information within a specific window size of the input sequence at each moment. The width of this window is generally much smaller than the length of the input, which can save a lot of calculations. This article presents three strategies for calculating local attention mechanisms. (1) It empirically specifies the size of the window . The model calculates the position p of the center point of interest of the decoder in the encoder output sequence at each moment. Then, the area of the calculation context is , where is the length of the sequence output by the encoder. (2) It empirically specifies the size of the window . We assume that the correspondence between input and output satisfies a monotonic increase; then, each output moment of the decoder is at the center point position in the encoder. Among them, . is the length of the time sequence of the decoder, and then, the area of the calculation context is . (3) It empirically specifies the size of the window and predicts the center point position pi in the model, and the solving process of pi is shown in

At the same time, in order to reflect the relationship that the closer the point to the center point, the greater the impact on the output; that is, the response is on the weight . We assume that the weight α at each position obeys a Gaussian distribution with mean pi and variance . This relationship is shown in

Among them, s is the position index of the encoder, and the variance and are obtained using formula (8).

The main focus position at the current moment is and the focus point at the previous moment ; then, . The specific process is as follows: we assume that the focus point of the last time is ; then, the range of the focus position at the current time à is . We assume that the position of the attention point of i at the current time satisfies the Bernoulli distribution, and a Bernoulli distribution experiment is carried out from . If the output of is 1, where , then, the position of the attention point of i at the current time is considered to be , The context information at the current moment is , as shown in Figure 6.

The historical alignment information is taken into account in calculating the context information at the current moment. The newly added historical alignment information helps to further strengthen the model’s ability to model long sequences. That is, in formula (7) becomes as shown in

Among them, . F adjusts the α dimension to the specified dimension.

4. Research on Correction Method of Spoken Pronunciation Accuracy of AI Virtual English Reading

The correction system of spoken pronunciation accuracy of AI virtual English reading is constructed on the basis of the previous algorithm improvement. The core framework of the AI virtual English reading system is shown in Figure 7.

We can directly create a subprocess to call and realize data transfer through an anonymous pipe. The process of AI virtual English reading is shown in Figure 8.

When the main process needs to call other subroutines according to the system logic, the system first creates an anonymous pipe, sets its read handle to A and write handle to B, and sets the startup information of the subprocess according to these handle pointers. After the preparatory work is completed, the child process can be created. After the child process starts, it first sets the read handle of the process to B and the write handle to A according to the startup information, which is just the opposite of the main process. After the interface setting is completed, the system can run the main program and wait for the input command. The main process waits for the child process to start, then writes the command through the B handle, and reads the output result through the A handle. When the child process gets the command, it calls the corresponding function according to the logic and writes the output result to the A handle. If it is the end command, it exits the process. In this way, the subroutine call is completed.

After the above model system is constructed, the performance of the AI virtual English reading system is verified with experiments, and English reading and spoken pronunciation are analyzed and evaluated through simulation research. The scoring results are shown in Table 2 and Figure 9.

On the basis of the above research, the effect of pronunciation correction in English reading is evaluated, and the results are shown in Table 3 and Figure 10.

From the above research, we can see that the correction system of spoken pronunciation accuracy of AI virtual English reading has good practical effects.

5. Conclusion

Traditional speech synthesis technology is often divided into nonparametric speech synthesis technology and parametric speech synthesis technology. Nonparametric speech synthesis technology is mainly based on unit selection. The main idea is that speech is spliced from speech unit fragments, and the speech unit database is made with sufficient coverage. In the prediction stage, the text is transformed into a phoneme sequence marked with prosodic features (fundamental frequency, duration, etc.). Using the set loss function as the evaluation criterion, the optimal speech unit is selected from the database, and the selected speech unit sequence is spliced into the final speech. This paper combines the intelligent voice technology to carry out the AI virtual English reading oral pronunciation accuracy correction model, to verify the performance of the AI virtual English reading system, and to analyze the English reading and oral pronunciation and pronunciation correction. From the experimental research point of view, the correction system of spoken pronunciation accuracy of AI virtual English reading proposed in this paper basically meets the basic needs of this paper to build a system.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the Heilongjiang Philosophy and Social Sciences Program, Research on Innovation and Practice of Project-based Business English Teaching Strategies in the Post-mooc Era. (No. 22YYB020), and Key Project of Heilongjiang Province Economic and Social (The Special Project for Foreign Language Discipline), Research and Practice of Business English Teaching Based on “Introducing Enterprises into Education” under the Concept of Project Teaching (No. WY20211083-C).