In order to improve the effect of spoken English training, this paper combines multimedia information technology to reform the teaching of spoken English training, and integrates BP neural network English into spoken English training. Moreover, this paper combines the actual needs of spoken English training and the teaching framework of the multimedia system to construct the data set, clean up the data set, and implement the word vector representation of students and professionals. In addition, this paper constructs the entire system framework of the spoken English resource recommendation algorithm based on the graph convolutional neural network, and combines the BP deep neural network algorithm to construct the spoken English training system. Finally, this paper designs an experiment to evaluate the effect of this system. The experimental research results show that the multimedia based on the BP deep neural network proposed in this paper has a good effect in the application research of spoken English training, and can effectively promote the effect of spoken English training of students.

1. Introduction

Oral expression ability, as an important English language skill, promotes the formation of students' English pragmatic competence, and is also an important manifestation of the function of English communication tools. Therefore, it has attracted great attention from teachers and students [1]. In particular, in the current college English classroom, teachers increase the intensity of the students' oral expression training, and hope to seize the golden period of language ability formation at the university stage to lay the foundation for the development of students' oral expression ability [2]. However, many teachers find that students lack enthusiasm for oral training in English classrooms, and their participation is generally low, which leads to unsatisfactory oral training effects, and the students' oral level is not improved. In view of this situation, teachers need to take a variety of measures to explore the reasons for the students’ problems in spoken English training, design oral training strategies suitable for students, and improve students’ spoken English level [3].

With the continuous development of the times, in order to cultivate more useful talents for the society, we also deeply realize that the teaching activities of teachers must closely follow the pace of the development of the times, so as to provide more effective guidance for students. English teaching can be divided into four teaching sections: listening, speaking, reading and writing. With the continuous development of the times, we have to admit that the ability of students to “speak” is becoming more and more important. However, as far as the current college spoken English teaching is concerned, there are still some problems that affect the further improvement of students' oral ability. Therefore, this paper makes a simple analysis on it.

For university students, there is also a factor that affects the improvement of students' oral English in the learning of oral English, that is, the students' oral foundation is relatively weak. Spoken language learning, especially for college students, does not start from scratch. Therefore, students will be affected by their existing oral language skills in the process of oral learning. Students generally do not dare to speak, inaccurate pronunciation and other phenomena. Students can use English for flexible communication requires a long-term practice process, and the first step is for students to have the courage to “speak” and boldly communicate in English. Only when students have the courage to speak, can they truly go out of spoken English learning. first step. However, college students generally do not dare to “speak”. Students are afraid of inaccurate pronunciation and fear of saying wrong. This kind of fear of students has seriously affected the improvement of students' oral English ability. Oral English learning is different from the learning of other English knowledge sections. Only in a certain environment can students truly achieve the purpose of oral practice, and only in a certain environment can students' oral ability be truly improved. As far as the current oral learning of university students is concerned, there is no specific environment for students to practice a lot of oral English. Students’ oral practice is only limited to classroom conversations or some English activities organized by the school. In fact, this is far from enough for students’ oral practice.

For improving the current situation of college students’ spoken English, it is crucial to pay great attention to spoken English. Teachers should pay attention to spoken English teaching, and students should pay attention to spoken English learning. First of all, teachers need to pay attention to spoken English teaching. This requires teachers to change the traditional English teaching concept and take the new teaching concept as the guide. Moreover, teachers should not only pay attention to teaching students in other parts of English, but also pay attention to oral teaching to students, and be able to introduce effective spoken English teaching methods into classroom teaching to improve the effectiveness of spoken English classroom teaching. In addition, teachers should make students understand the importance of spoken English teaching, and spend part of their English learning time on spoken English learning.

This paper combines the BP deep neural network algorithm to construct an spoken English training system to improve the intelligent effect of spoken English training, and evaluates the performance of the system through experimental research.

Literature [4] believes that programmed language is “a part of automatic or semi-automatic memory reserve, even through the screening mechanism of dividing these words, they are still automatically stored”. Literature [5] believes that spoken language is a series of words that people habitually use together. Literature [6] proposed that spoken language is a way for the sharing of words in natural texts to reach a statistically significant number: from the perspective of language programming, Wray's term “programming language” has a far-reaching impact. Formula language refers to: it is a combination of continuous or discontinuous words or other elements, is orderly and prefabricated: it is stored and retrieved as a whole in the process of language use, and is not affected by the generation and analysis of grammar Influence. Literature [7] proves that word chunks are cognitively processed as a single vocabulary unit. Moreover, word blocks are learning materials suitable for human cognition.

Literature [8] pointed out that there are a large number of lexical sequences ylJ (sequences) in English, and a large number of prefabricated sentences of this kind exist, which explains to us why the language of second language learners can be “as fluent as native speakers.” And the answer to “choose like native speakers”. Literature [9] pointed out that many of the native language users' knowledge are spoken languages ​​that are not required and impossible to analyze outside of grammar. He explained that communicative competence is actually the ability to master a large number of assembled structures, formulaic formulas and a set of rules, and be able to make necessary adjustments according to the context. Literature [10] believes that language is not composed of traditional grammar and vocabulary, but composed of multi-word prefabricated spoken language. Literature [11] pointed out that the degree of language fluency is determined by the amount of programmed language stored in the brain's memory, rather than by the knowledge of grammar. Literature [12] pointed out: learning spoken language is more important than learning grammar, language knowledge is spoken knowledge to a considerable extent, and grammar is second. Literature [13] believes that spoken language is the basis of second language learning, and the teaching of common phrases (stock phraSe) in language teaching should be as important as vocabulary and grammar teaching.

Literature [14] studied the differences between college students and middle school students in the use of spoken English, and pointed out the wrong ways of using spoken English by students. Chinese English learners seldom use multiple words and idioms in oral English, and there are differences in the frequency of use of phrase structure and sentence structure. There is a significant difference between the oral language use of domestic speakers and English speakers. There is no significant difference between college students and middle school students, while the oral language use of college students is significantly higher than that of middle school students. Spoken language errors in oral communication include mother tongue transfer errors, acquisition errors, and redundancy Errors, mixed use errors, etc.

Literature [15] believes that: in the aspect of language learning, oral input can reasonably configure the cognitive resources of information processing; in the aspect of oral language, prefabricated oral language can help to communicate meaning first and coherently; in the aspect of reading, oral language is effective Prefabrication can process low-end information more quickly, thereby speeding up reading; in terms of writing, spoken language has a good effect on preserving short-term memory, and can vividly express one's thoughts in words. Literature [16] believes that the richer the storage of spoken language, the more proficient in calling the spoken language, which has a positive impact on all aspects of language learning, listening, speaking, reading and writing. Literature [17] pointed out: oral language also has its own limitations. In language learning, oral input cannot be one-sidedly emphasized, while grammar and other indispensable aspects of language learning are ignored. It is necessary to ensure that learners can stay in language learning for a long time. In order to make progress, only by correctly handling the direct relationship between oral input and grammar, and making up for each other's shortcomings, can the various abilities of the language develop in a coordinated manner. Literature [18] did research on whether the use of spoken language affects spoken English and writing. It is believed that the use of spoken language can affect students' English spoken and written scores, and the impact is greater. Compared with grammar knowledge, spoken language is in the structure of English knowledge. Obviously occupies a more important position. The oral English level of students largely depends on their ability to call the oral English in their minds. Students who can call as many oral English as possible more accurately tend to have higher oral English. The literature [19] found that the higher the student's writing score, the more proficient and richer the oral English.

3. Construction of spoken English resource recommendation model based on BP deep neural network

This article analogizes students as users, majors as projects, and students’ class relationships as users’ social networks, and establishes a professional recommendation model based on students’ social interactions. The characteristics of students in social networks will be spread, which affects students’ professional choices. Therefore, the influence propagation layer is introduced into the social recommendation network, and then a better loss function is selected to improve the accuracy of the recommendation result. The spoken language resource recommendation model proposed in this paper is divided into three parts. The first part is student modeling, which is divided into two models. The first model is to understand students’ preference for majors through the interaction of the student-professional diagram. The second model extracts student characteristics from social networks, and integrates the characteristics of student friends into each student node in the social graph to model students. Combining the information of student-professional space and social space, the potential characteristics of students are intuitively acquired. The second part is professional modeling. In the interaction of the student-professional diagram, the evaluation of students of the same major is aggregated to reflect the potential characteristics of the major. The third part is the integration of the two models of students and spoken English. Through the full convolutional layer learning model parameters, the matching degree calculation between the student and the recommended spoken English is performed.

and are the sets of students and spoken English respectively, where is the number of students and is the number of spoken English. We assume that is the student-English speaking scoring matrix, which also known as the student-English speaking graph. If candidate scores spoken English , then is non-zero, otherwise we use 0 to indicate the score of spoken English by candidate , that is, is the set corresponding to the spoken English that student has not evaluated. represents the social graph of students, where means that and are closely related.

The following is a detailed introduction to the spoken language resource recommendation algorithm based on the BP deep neural network.

The purpose of building a student model is to extract the potential characteristics of students. The student model is divided into two parts: the student-English speaking picture and the social network. The two parts are described separately below.

Because a student’s score on a spoken English can tell the students’ preference for the spoken English, the score of the spoken English and the characteristics of each spoken English can help students model. The student spoken English graph model in this paper is based on this to integrate spoken English features and spoken English evaluation to extract potential characteristics of students.

Among them, is the multi-layer perceptron (MLP) of students and spoken English, is the interactive representation of each spoken English and score, is the feature vector of spoken English, is the feature quantity of five evaluation levels, and represents the connection of two vectors. The calculation of formula (1) combines the scores corresponding to each spoken English and each spoken English to perform feature extraction. Since there are many spoken English that each student participates in the evaluation, it is necessary to aggregate all the spoken English that the students participate in the evaluation, and we introduce an aggregation function.

Among them, is the spoken English evaluated by student , W and b are the weights and biases of the neural network, and is the Relu activation function. The most common aggregation function is to directly aggregate the multiple spoken English evaluated by each student, that is, formula (3).

However, because the weight of each spoken English is different for students, the average aggregation cannot be directly calculated. In order to make up for the lack of average aggregation, the attention mechanism is introduced. is the weight of the attention mechanism.

Among them, is the feature vector of student , and the interaction of spoken English and scoring is connected with the feature vector of the student. and are weights and biases, and the Softmax function is use to normalize the above attention weights to get the final attention weights. It can be understood as the contribution of interaction to the latent factors of the student-English speaking space. After the calculation of formula (1)-(7), we extracted the potential features of each student in the structure of the student-English speaking graph. The potential characteristics of students will also be reflected in the social relationships of students. Therefore, we introduce the student social network model to extract potential features.

This paper compares students as users, and oral English as projects, and compares students' class relationships to users' social networks, and establishes a model of oral English recommendation based on students' social interactions. In view of the fact that the social relationship of students will affect the students' choice of spoken English, the influence communication layer is introduced into the social recommendation network. The following is an improved social network model, as shown in Figure 1.

The characteristics of each student are divided into potential characteristics and obvious characteristics. Obvious characteristics can be reflected by students' evaluation of spoken English, and latent characteristics are reflected by students' social relationships. Therefore, we set up latent features, perform free embedding on the latent features of students, and continuously update the latent features through social relationship fusion.

The input of the fusion layer is the apparent feature vector and latent feature vector of each student. The two feature vectors are fused, and the fully connected layer is used for feature extraction. At this time, the output is the entire feature of the student.

Among them, is the weight matrix, is a fully connected layer, is the potential feature of the student, and is the obvious feature.

The influence propagation layer is shown in Figure 2. Among them, the green node represents the current node , the gray node is the first-order neighbor node of the current node, and the red node is the current node is the neighbor node of the first-order neighbor node.

Due to the change of time, the influence in social networks is spreading. For the current node is the input of the fusion layer, and is the feature obtained by averaging the first-order neighbor nodes (all gray nodes) of the current node . At this point, the current node has the characteristics of the four neighbor nodes on the blue background. is the first-order neighbor node of the first-order neighbor (the result of the fusion of the red node, and it can be seen that the node at this time has the characteristics of all the nodes in the green background) is also the feature of average fusion. Therefore, if is the student's feature vector after impact diffusion, is the set of neighborhood nodes of student , and is the neighborhood node feature of student after social networking. Through the feature fusion of the neighboring nodes after times of social network influence, the characteristics of student are obtained as:

The social network and the student-English speaking graph provide the characteristics of the student from different angles. Therefore, in order to extract all the potential characteristics of the student, it is necessary to combine the student-English speaking graph model and the social network model.

Among them, the output is a combination of the potential features of the student-English speaking graph model and the social network model, and the student features are extracted through the full convolutional layer.

Just like the student model, in order to learn the latent features of spoken English, the spoken English model aggregates the evaluations of students who select the same spoken English to reflect the latent characteristics of spoken English.

Among them, is the potential feature of spoken English after connecting the feature vector of student and the feature vector corresponding to the score, and extracted by the full convolution layer. Since there is more than one student participating in the evaluation of spoken English, it is necessary to aggregate all the students who evaluate the spoken English, and for this purpose, the student's aggregation function is designed.

Considering that each student has different meanings for extracting the potential features of spoken English, the fusion function needs to introduce an attention mechanism to give students who choose the same spoken English different weights.

Among them, is the potential feature of spoken English extracted by the aggregation function after introducing the attention mechanism.

The recommendation algorithm of spoken English is simply to match the characteristics of the students with the characteristics of the spoken English. Proved that the fully connected layer can fit arbitrary functions. Therefore, we choose to use multiple full convolutional layers to learn the similarity function between students and spoken English features.

Among them, is the latent feature of the student, is the latent feature of spoken English, and means connecting the two vectors.

Among them, represents the number of hidden layers, and represents the matching degree of student to spoken English .

The above process is to perform feature extraction of multiple fully connected layers (MLP) on the result of the splicing. Finally, the obtained feature matrix is mapped to a one-dimensional vector, that is, the predicted value of each spoken english.

The loss function of BP deep neural network based on social recommendation is:

Among them, is a set of tuples , where tuples represent student and graded spoken English . Therefore, this loss function is to calculate the loss for all scoring spoken English. For the spoken English without scoring, the loss is not considered, and the L2 loss function. When the predicted value is significantly different from the original value, the loss penalty is too large. When the gradient descent method is used to solve the problem, the gradient is large, which may cause the gradient to explode.

Therefore, we smoothed the loss function and chose the Smooth Loss loss function. The function is as follows:

It can be seen that when is less than 1, there is only one parameter difference from the original second order. However, when the absolute value of is large, compared with the original loss function, the Smooth loss function reduces the loss penalty and becomes a first-order loss. Therefore, using the Smooth Loss function can reduce the proportion of outliers and alleviate the model's deviation to the outliers compared to using the L2 loss function.

In order to prove the basis for the selection of the loss function, we conducted experiments on common loss functions and compared the commonly used loss functions in the recommendation system. For the MSE loss function, Exp exponential loss function, and Smooth loss function with different parameters, according to experiments, it is found that when the parameter in the Smooth loss function is 0.6, the MAE calculation results, RMSE calculation results, and accuracy are the best.

4. Research on application of multimedia in spoken English training based on BP deep neural network

The voice part of the audio can be detected and extracted to enter the text-to-speech alignment process. For the sake of continuity, this paper first analyzes and discusses the forced alignment algorithm based on Viterbi decoding. On this basis, this paper proposes an improved fault-tolerant alignment algorithm based on extended matching network. Moreover, this paper presents a detection method for insertion, deletion, and replacement errors at the word and phrase level, and a larger-scale dynamic alignment algorithm for sentence level. Figure 3 shows the basic flow of text-to-sentence matching of a single sentence.

This paper uses the matching network shown in Figure 4 to extend the search network of the traditional forced alignment algorithm, so as to achieve a fault-tolerant mechanism at the word and phrase level. SIL stands for silent model, OOV stands for garbage model.

This article further discusses how to provide a solution based on the idea of dynamic programming for the alignment of continuous corpus and incomplete matching corpus on a large scale (sentence level). The overall steps of the algorithm are shown in Figure 5.

Due to insertion, deletion, and replacement errors, not all models in the matching process can have corresponding nodes, especially the context-dependent HMM model (TRIPHONE) located at the word boundary. In the TRIPHONE model corresponding to syllable K as shown in Figure 6, since the word YOU is skipped during matching, the corresponding TRIPHONE model changes from <NG-K-Y> to <NG-K-V>. Therefore, in addition to comparing the corresponding nodes, it is also necessary to compare the word boundaries to prevent the expansion of the search subspace caused by some models in the second stage of matching because they cannot find the corresponding nodes.

The system front-end is based on the standard algorithm for feature extraction of the distributed speech recognition front-end of the European Telecommunications Standards Agency as shown in Figure 7. In order to ensure sufficient accuracy of the front-end features, a single-frame amplitude normalization technology is applied to make full use of the processing length of the processor to make the accuracy of fixed-point operations meet the requirements of the back-end recognition engine.

Figure 8 shows a schematic diagram of the voice processing flow of the system.

The modules of the speech recognition system include five parts: feature value extraction, phoneme recognition, phoneme association, pronunciation evaluation, and error detection. The speech recognition module is shown in Figure 9:

The external function realizes the visual window. The example sentence library view is played according to the specified example sentence or learning strategy. The information view displays sentences and phonetic symbols, displays monophony scores and corrections, displays sentence prosody scores and corrections, and plays corrections. The user management interface displays the user's transcripts, study files, and analyzes easy-to-mispronounce phonemes and common errors. The system can recognize English learners with a strong Chinese accent.

This paper conducts training research on the model of this paper, and the results are shown in Figure 10 below. It can be seen from Figure 10 that the student’s speech waveform is similar to the standard speech waveform, which means that the student’s spoken English h training is very effective and the student’s spoken English is very standard.

On the basis of the above analysis, the effect of the spoken English training of this system is evaluated, and the results shown in Table 1 and Figure 11 below are obtained.

It can be seen from the above research that the multimedia based on BP deep neural network proposed in this paper has a good effect in the application research of spoken English training, and can effectively promote the effect of spoken English training of students.

5. Conclusion

In recent years, with the continuous advancement of new curriculum reforms, it is required to pay great attention to spoken English teaching. Although oral teaching has been improved to a certain extent compared with the previous teaching, it is still difficult to achieve the purpose of effectively training students' oral skills, which seriously affects the improvement of students' oral communication skills. The current intelligent spoken English learning system needs to provide functions such as recognition of the user's pronunciation, comparison with expert pronunciation, and error correction. The basis of all these functions is speech recognition. The accuracy of speech recognition and the robustness of the recognition algorithm will directly determine the overall performance of the learning system. This paper combines the BP deep neural network algorithm to construct the spoken English training system, improves the intelligent effect of spoken English training, and evaluates the performance of the system through experimental research. The experimental research results show that the multimedia based on the BP deep neural network proposed in this paper has a good effect in the application research of spoken English training, and can effectively promote the effect of spoken English training of students.

Data Availability

The labeled dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares no competing interests.


This work was sponsored in part by the year of 2020 Quality Engineering Project of Anhui Province (2020jxtd155).