Scientific Programming for Industry 5.0: Theory, Applications, and Technological DevelopmentView this Special Issue
English Speech Recognition and Pronunciation Quality Evaluation Model Based on Neural Network
An in-depth neural network-based approach is proposed to better develop an assessment model for English speech recognition and call quality assessment. By studying the structure of a deep nonlinear network, you can approximate complex functions, define distributed representations of input data, demonstrate a strong ability to learn important data set characteristics from some sample sets, and better simulate human brain analysis, and learning. The author uses in-depth learning technology to recognize English speech and has developed a speech recognition model with a deep belief network using the characteristics of the honey frequency centrum based on human hearing patterns. The test results show that examples include 210 machine and manual evaluations and 30 samples with first-grade differences. The overall compatibility level of the machine and human evaluation is 90.65%, and the adjacency consistency level is 90.65%. This is 100%, and the correlation coefficient is 0.798. We need to evaluate the quality of speech and pronunciation in English, which indicates a strong correlation between machine estimates and human estimates.
With the development of pattern recognition technology, intelligent human-computer interaction modes and methods are gradually applied to various industries, especially the extended application of voice interaction. Speech recognition is rapidly integrating into many fields. It has improved the level and efficiency of intelligent interaction in many aspects, such as mobile phones, unmanned driving, and industrial robots, compared with text interaction, voice interaction is faster and more efficient; however, it is more complicated in terms of the stability of the realization function and the realization technology; therefore, speech recognition technology with high confidence is very important . The current research on speech recognition mainly focuses on two aspects: speech signal feature extraction and speech feature recognition, the former focuses on filtering and signal analysis techniques, in order to extract the features of the native speech signal issued by people as accurately as possible, the latter focuses on training speech signal features and minimizing speech recognition errors as much as possible. The so-called speech neural network recognition technology, which supports artificial intelligence technology, is a widely used neural network information processing system . Some experts in the field of speech neuroscience have used this new network information processing technology, called the independent processing unit of the neural network. The transfer of information and functional connections between different functional plates occurs as a result of the interaction of information processing systems of individual neurons . Information is stored in the electronic core chip of the neuron system, simulating the information processing model of the human brain and realizing the conversion of the encoding of the abstract content of information to the concrete form of the transfer . By encoding and translating nonlinear information, the performance efficiency of data transmission and processing can be improved as much as possible by mimicking neural network calls and adhering to basic operating principles. Building a smart neural information processing network has many advantages over traditional data transmission and processing modes. Based on the application patterns of traditional neuron simulation information processing systems, a deep-faith network information transmission system for in-depth learning was developed (Figure 1) . The use of English pronunciation quality and speech recognition methods will gradually lead to the development of English phonology, the structure of the learning system, and a multi-level development model. English teachers working in the field of educational research have made it clear that the main way to build this network based on in-depth learning based on neural structure models is based on self-directed learning, a form of learning, where education belongs to the category of uncontrolled learning.
2. Literature Review
Long proposed that by preprocessing the acoustic signal, in order to increase the accuracy and robustness of feature extraction, and optimize the combination of the acoustic feature Mel-frequency cepstral coefficients, the preprocessing method of this method for speech feature data is worth learning from ; By exploiting parallel model merging algorithms, in order to adjust the relevant parameters and weights in the model, compensated for speech distortion containing noise, which improves the robustness of speech recognition system in a noisy environment. However, the method in this paper only considers the influence of stationary noise, while in reality most of them are nonstationary noise . Adjusting parameters in hidden Markov models and improving the Viterbi algorithm, models with relatively low confidence in the recognition process are trimmed, in order to achieve higher recognition accuracy and shorter matching time. However, this method has a large space complexity in the back-end decoding algorithm, which restricts the scalability of the overall speech recognition system . The acoustic model of the convolutional neural network was established by Qin et al. for speech recognition. Compared with the deep neural network structure with the same number of layers, the performance is improved by more than 10% . Wang and IBM’s Watson research team optimized disruptive neural network parameters, such as the number of disruption layers, the number of latent layer nodes, and the choice of input functions, and used them extensively for continuous word recognition systems. Experimental results show that the CNN structure improves the performance of traditional GMM acoustic models by 13.30% . Yun’s parallel processing technology of compute unified device architecture (CUDA) through GPU realizes the complex layer structure and fast matrix operation of the neural network, it speeds up the parameter transfer process in neural networks. Experiments show that the software optimization (using the CUBLAS library) compared to NVDIA saves nearly 70% in time . Yang et al. used cluster technology to build a large-scale distributed neural network, a model parallel training method is proposed, that is, the model is divided into blocks, they are handed over to the machines in the cluster for processing, and the resources of each machine are fully utilized, it is only necessary to synchronize the information of nodes on the edge of the dividing line to save the network overhead . Bang In Hessian’s ReLU activation function and dropout strategy are added to the free training method, that is to improve the performance of the neural network by preventing the joint action of local filters, and at the same time, the risk of falling into a local optimum is reduced . Liu proposed a restricted Boltzmann machine with convolution as a pretraining mechanism for CNN models, experiments show that it shows better performance in large-vocabulary speech recognition tasks . Ryu et al. demonstrated in detail the deep learning algorithm, the research situation in speech recognition, and the key problems to be solved, a research direction was proposed for the application of deep learning to speech recognition . Arora et al. used recurrent neural networks for speech recognition; the recognition accuracy is high, but the recognition efficiency is low, which is not enough to meet the requirements of real-time recognition .
Based on current research, an in-depth study based on neural networks is proposed. By learning the structure of a deep nonlinear network, we use in-depth learning technology to recognize English speech and create a speech recognition model using parameters of the Mel-frequency cepstral characteristic based on the human auditory model and the deep belief network.
3. Neural Network Models
3.1. Overview of Neural Networks
Artificial neural networks (ANN), also known as neural networks (NN), is interconnected by a large number of processing units - neurons (neurons), simulating the way the human brain processes information, complex network systems for parallel processing, and nonlinear transformation of information . Since the neural network is an abstraction, simplification, and simulation of the human brain, it has a large number of parallel distributed structures, nonlinear characteristics, and learning and induction capabilities, becoming prominent in areas such as signal processing and control, pattern recognition, modeling, and time series analysis .
The basic information processing unit of ANN is the neuron, also known as a node or a network. The neuron model is shown in Figure 2. Generally, it should have three elements.(1)A set of connections, the weight on each connection represents the connection strength, can take positive or negative value, positive value means activation and negative values indicate inhibition.(2)The additive is used to add a synaptic weight to the neuron corresponding to the input signal.(3)Excitation function that limits the amplitude of neuron output. The excitation function limits the output signal to an acceptable range, making it a finite value, usually in a closed interval [0, 1] or [−1, 1]. In addition, an external bias (threshold) can be added to the neuron model, denoted by . The effect of restriction depends on whether it increases or decreases the network input to the excitation function, whether positive or negative. Therefore, artificial nerve cells can be expressed by
Among them, represents the r inputs of a neuron; represents the connection strength (connection weight) of the ith input; is the bias (threshold) of the neuron; and is the output of the neuron. It can be seen that the artificial neuron is a nonlinear structure with multiple inputs and single output .
3.2. BP Neural Network
As shown in Figure 3, the BP network is a neural network of three or more neurons. The BP network has two learning processes.
BP learning algorithm.
The learning process of the BP algorithm is shown in Figure 4.
3.3. Deep Learning Neural Networks
The essence of in-depth training is to improve the accuracy of predictions and classifications by creating multilayer machine learning models and large amounts of training data and ultimately to learn more important data functions. Therefore, the goal of “learning by feature” is a “deep model” approach. Unlike traditional shallow training, in in-depth learning features there are usually five, six, and ten hidden layers, which emphasize the depth of the design structure. The advantage of multilayers is that complex functions can be expressed with fewer parameters and features are highlighted. Significance and need for learning, i.e., the expression of a sample property in the original space, is transformed into a new property space through each layer of properties expressed by the nature of the distribution of the detected data and easily becomes a hypothesis and classification. Using big data to learn features can better represent important data information than artificially construct functions with rules .
3.3.1. Basic Concept and Training Process
The basic idea of in-depth training is to use supervised training to adjust teachers at each level to unsupervised training, and their learning outcomes are used as input from the previous level. The layers of single-layer neurons are made up of layers, and only one layer of the network is trained at a time. Use a cool sleep algorithm to adjust the parameters when training all layers. Change the weight between the layers except the top layer in two directions. In this way, all layers, except the top layer, remain a single-layer neural network, becoming a graphical model. Rising weight is “cognitive” and downward weight is “creating”. Then use the sleep algorithm to adjust the entire weight. Identify and create as much as possible, that is, the top-level representation created can restore the base node as accurately as possible.
The sleep-wake algorithm is divided into two parts: wake and sleep.①Return Process: The cognitive process uses a gradient reduction method to create an abstract representation of each level (node state) with external characteristics and upward-pointing weights (cognitive weights) and to shift the weight from layer to layer. It changed my weight, so if the reality was different from what I imagined, it would be what I imagined .②Sleep phase: High-level imagery (concepts learned while awake) and the process of creating low-level weights create a low-level state and change the weight from layer to layer. In other words, dreams are not a concept that fits in my head, and it is important to make changes in my cognition. This illusion is what I see.
The training process of deep learning is as follows:①Learning without supervision from the bottom up (for example, using layers from bottom to top). Uncontrolled hierarchical training of the parameters of each layer using untested data (or using configured data): The biggest difference from traditional neural networks is that this step is tailored to the specifics of the learning process.②Uncontrolled learning (e.g., labeled data training, top-down error distribution, and fine-tuning of design parameters): Uncontrolled training is based on the parameters of each layer obtained in the first stage and further adjusts multilayer design parameters. Unlike the initial startup of a random neural network, the initial configuration parameters of DL are obtained by studying the input data structure in the first stage, so it is not a random startup, and the initial value is close to global optimization. You can achieve better results. Thus, the effects of DL benefit from the effects of the early learning process.
3.3.2. Restricted Boltzmann machine①RBM overview An RBM consists of a visible layer and a hidden layer, fully connected between layers and no connection within layers. In other words, there are only edges between the visible and hidden parts, and there is no edge connection between the visible and hidden parts . Figure 5 is the structure of the RBM network, where and are the visible and hidden layers of the number of neurons in the text, is the subscript, and the visible and hidden layers are h.②RBM learning algorithm The methods of interpreting RBM are the energy function and the probability distribution function. The combination of these two methods shows that the probability distribution is a function of the energy function. where is a parameter of the RBM model, represents the bias of the visible layer node i, represents the bias of the hidden layer node j, represents the connection weight between the visible layer node i and the hidden layer node j. It can be seen from the energy function that the distribution of the joint probability can be determined from the Boltzmann distribution and the setting energy. Among is the normalization factor, also known as the partition function. Because the hidden layer nodes are conditionally independent (that is, there is no connection between nodes), that is Further, factorizing the above formula factorizes, it can be obtained that on the basis of a given visual layer , the probability that the jth node of the hidden layer is 1 or 0 is Similarly, given the hidden layer h, the probability that the ith node of the visible layer is 1 or 0 is Given training samples, training an RBM means learning to tune the parameters . Even if this parameter is used to adjust a given training sample, the probability distribution expressed by the corresponding RBM is as close as possible to the training data. Given a sample set that satisfies the independent and identical distribution: , the goal of training an RBM is to maximize the following log-likelihood function (maximum likelihood estimation: for a probability model, you need to choose a parameter that maximizes the probability of the current observed sample). Since the multiplication is more troublesome to deal with, according to the strict monotonicity of the function 1nx, maximizing is equivalent to maximizing .③RBM evaluation method The simplest assessment measure for learning RBM is the RBM log-probability of learning data. However, due to regulatory factors, the complexity of the calculations is very high and the advantages and disadvantages of RBM can only be assessed by approximation. The most commonly used approximation is the recovery error. The training sample is the initial state. The difference between the initial data and the initial data after the GBM transfer is made by RBM distribution. The recovery error is the probability of RBM. Training samples can be evaluated. It is not reliable. However, in general, the calculation is very simple and very low cost .
3.3.3. Pronunciation Quality Evaluation
The pronunciation quality evaluation of speech can be divided into subjective evaluation and objective evaluation, as shown in Figure 6. Subjective evaluation refers to the evaluation of the pronunciation quality of speech by language experts. The process can generally be divided into three steps: first is to listen to the test voice; then, according to the prior knowledge of language accumulated by oneself, the test voice is compared with the standard voice in memory, the differences between the two at all levels are discovered; finally, the differences of each level are integrated, and the overall evaluation of the test speech is given. Generally speaking, language experts evaluate the pronunciation of test speech, it can more realistically reflect the pronunciation quality of the test voice and the tester’s spoken English level. However, due to the differences in the knowledge structure and experience among language experts, different experts may have biases on the same test speech. In addition, since the evaluation of speech pronunciation quality is not only closely related to phonetics and linguistics but also related to physiology and psychology, even with the same test voice, the evaluation given by the same expert in different states may also be different. Therefore, the subjective evaluation of pronunciation quality ensures the authenticity of the evaluation results, but it also exposes its subjective shortcomings. Objective evaluation refers to the use of machines to automatically evaluate the pronunciation quality of speech. Computers are used to objectively evaluate learners’ pronunciation quality, it can effectively overcome the shortcomings of subjective evaluation, reduce evaluation deviation, and improve evaluation efficiency. Objective evaluation has a unified evaluation standard, when faced with a large number of speech evaluation tasks, its advantages will be more prominent. The design of the objective evaluation system should simulate the evaluation process of English experts on test speech as much as possible.①Pitch evaluation The sound assessment mainly checks that the information in the pronunciation of the pronunciation is complete and accurate, that the pronunciation is clear and fluent, and that the pronunciation is inaccurate. The author uses the MFCC coefficient as a parameter to evaluate the sound level of the human hearing model, creates a speech recognition model through a network of deep speech recognition trusts, and filters out the completeness or accuracy of the content. As shown in Figure 7, the standard is used to calculate the characteristics of the input sentence and MFCC, to assess whether the pronunciation is clear and fluent, to make a comprehensive assessment, and to comment on the quality of the English pronunciation.②Speaking speed evaluation
There is a definite difference in sentence length between different people because the speed of speech is different between different speakers who pronounce the same sentence. In addition, the speaker’s emotional state affects the speed of speech, such as anger and joy, for example, the speed of speech is slightly faster than moderate, while sadness is generally slower. As shown in the following formula:
Among them, is the duration of the standard sentence and is the duration of the test sentence.
Further, is compared with the set speech rate threshold, as shown in Figure 8.
4. Experimental Simulation and Result Analysis
4.1. Data Sources
The authors used a set of spoken Arabic numerical data from a UCI machine training warehouse developed by the Institute of Automatic Signaling at Baji Mokhtar University. The data set is an Arabic numeral call derived from the MFCC’s 13th function parameter, containing 8,800 conversational data (10 calls in 10 Arabic numerals, each digit repeated 10 times) consisting of 44 women aged 16–40 years. The parameters that need to be set before decomposing the MFCC function parameters are 16 kHz sampling rate, 16 bit encoding, Hamming window is used to add the window function, and the prehighlighted filter function is 1–0.97 Z-1.
4.2. English Sentence Data Sources
We are college students, a total of 25 people, 14 boys and 11girls. Subjects were recorded using a soft recording editor with a sampling frequency of 16 kHz and 16 bit encoding. A total of 10 sentences were registered.(1)We always post it.(2)Store clothes in the refrigerator.(4)HE WILL NOTIFY ON WEDNESDAY. I can tell her tonight.(5)There is black paper on the edge of the paper.(6)The leader will arrive in seven hours.(7)What is that bag under the table?(8)They had just moved to the second floor and are now descending again.(9)I always come home to see Agnes.(10)Take it and go with curly hair.
4.3. Experimental Results and Analysis
Here, the author’s model is compared with the above model, and the results of the recognition level comparison are shown in Figure 9.
As can be seen from Figure 9, the recognition rate of the DBN model constructed by the author is 96.64%, which is better than the above models.
4.4. Speech Evaluation Experiment
The purpose of the speech assessment test is to test the performance of the author’s proposed English call quality assessment model and method, and to verify the relationship between machine assessment and manual assessment using examples of the same sentence in English. The specific process includes: first, checking the reliability of the human assessment, and also verifying the correlation between the machine assessment and the human assessment based on the reliability of the human assessment.
The specific method of calculation is as follows:
The adjacent agreement rate is the ratio between machine evaluation and manual evaluation, the sum of the adjacent samples, and the total number of samples, among them, “adjacent” is defined as one level difference between machine evaluation and human evaluation. The specific calculation method is as follows:
The special method of calculation is as follows:①Manual evaluation
Depending on the different levels of English speech and the characteristics of the quality of pronunciation of college students, we have a comprehensive assessment of four different levels of assessment by the Committee of English Pronunciation Experts (layers, speed, rhythm, and melody), A, B, C, and D, respectively, and a detailed evaluation statement and related evaluation standards are listed in Table 1.
The manual assessment was conducted by two experienced college English teachers. They evaluated 10 commonly used English phrases written by students of 24 colleges of our school one by one, i.e., dialect, speed of speech, rhythm, melody 4 assessments, and comprehensive assessment indicators.
The author uses the Pearson correlation coefficient to verify the reliability of the manual assessment results, given that the subjectivity of teachers in the manual assessment process may affect the assessment results.
To make it easier to calculate, convert grades A, B, C, and D to 4, 3, 2, and 1, respectively. The scores of each of the four evaluation parameters, such as the Pearson’s correlation test (two-way test), tone, speed, rhythm, and melody, or the total score, are positively correlated for each group (r > 0, < 0.05). This suggests that the assessment process for both teachers should be based on essentially the same assessment standards, effectively ensuring the reliability of the experimental information.
Further, the evaluation results of the two teachers are averaged (rounded up), different sentences of different students, each evaluation index and overall score are obtained as the final manual evaluation result .
According to the method described by the authors, 24 students, 10 sentences, a total of 240 sentences can be scored for the four evaluation indicators of pitch, speed, rhythm, and intonation, further comparison with manual evaluation, the experimental results are shown in Tables 2 and 3.
The regression analysis method uses mathematical statistics to establish statistical models, study the statistical relationship between the variables of objective things (the structural state and closeness of the relationship), through a large number of experiments and observation data on objective things, look for the statistical regularity hidden in those seemingly uncertain phenomena (that is, the expression of the regression relationship function between the dependent variable and the independent variable), make model predictions. The author uses the overall score of human evaluation as the dependent variable, taking intonation, speech rate, rhythm, and intonation scores as independent variables, English sentences that are completely consistent with human evaluation and machine evaluation are selected, the multiple linear regression analysis method is used, and the weight of each evaluation index is obtained by using the SPSS software, as shown in the following formula:
The author also conducted a comprehensive assessment of 24 students with a total of 10 sentences and 240 sentences using formula (14). Figure 10 and Table 4 show the test results. Examples include 210 machine and manual evaluations, 30 samples with first grade differences. The overall compatibility level of the machine and human evaluation is 90.65%, the adjacency consistency level is 90.65%. This is 100%, and the correlation coefficient is 0.798. The quality of speech and pronunciation in English is evaluated.
Some experimental data are shown in Table 5.
Although the results of this experiment are very good, the author reiterates that this may be due to the following factors: (1) the author has a small amount of data and adjusted the model by linear regression analysis. (2) Two English teachers use the average score as the final score for the manual assessment, and the rounded score is higher. (3) The author uses the four-level scoring method for the assessment. In equation (14), the melody weighs the most, followed by rhythm, melody, and speed. After obtaining the results of the abovementioned experiments, the author contacted the English phonologists of our school and confirmed the reliability of the model for evaluating the quality of speech proposed by the author in many parameters. According to experts, in assessing the quality of English pronunciation, the melody is accurate, the pronunciation is smooth, there are no obvious errors in pronunciation, the rhythm and melody mainly reflect the speaker’s emotional tone, improve the tone of voice, improve pronunciation, and tone. It is the most important indicator that requires you to improve and report the tone of the call, optimize the tone of the call, express the tone of the call correctly, pronounce the tone of the call correctly, and avoid making mistakes in the call.
With the help of deep learning technology research, neural networks can be rejuvenated with a new round of vitality, and speech recognition technology has also been developed with each passing day. Deep neural networks represent complex functions with few parameters based on the structure of multilayer nonlinear neural networks. The step-by-step conversion of uncontrolled traits can better detect the nature of data distribution and demonstrate excellent learning ability, which is more helpful in improving the accuracy of classifications or hypotheses. This is confirmed by the Arabic numerical data set at the UCI of the machine learning library, and the recognition results are better than the improved latent Markov model, the BP neural network model, and the tree proximity model. In addition, this article uses a deep-faith network model to assess the quality of English speech. A deep-faith network-based speech recognition model is used to assess intonation.
No data were used to support this study.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
K. I. Lee, Han, and L. Kun, “Speech recognition and lip shape feature extraction for English vowel pronunciation of the hearing - impaired based on SVM technique,” Journal of Rehabilitation Welfare Engineering & Assistive Technology, vol. 11, no. 3, pp. 247–252, 2017.View at: Google Scholar
Z. Lv, Y. Li, H. Feng, and H. Lv, “Deep Learning for Security in Digital Twins of Cooperative Intelligent Transportation Systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, 2021.View at: Google Scholar
G. Zhang, “Quality evaluation of English pronunciation based on artificial emotion recognition and Gaussian mixture model,” Journal of Intelligent and Fuzzy Systems, vol. 40, no. 2, pp. 1–11, 2020.View at: Google Scholar
Z. Yun, “Research on spoken English speech recognition technology in computer network environment,” Boletin Tecnico/Technical Bulletin, vol. 55, no. 16, pp. 445–449, 2017.View at: Google Scholar
T. T. Shi, X. B. Zhang, L. P. Guo, Z. X. Jing, and L. Q. Huang, “Research on remote sensing recognition of wild planted lonicera japonica based on deep convolutional neural network,” Zhongguo Zhong yao za zhi = Zhongguo zhongyao zazhi = China journal of Chinese materia medica, vol. 45, no. 23, pp. 5658–5662, 2020.View at: Publisher Site | Google Scholar