Abstract

In order to improve the effect of English teaching practice, this paper constructs an intelligent English phonetic teaching system combined with the method of phonetic feature parameter recognition. Moreover, this paper simulates the self-mixing interference signal containing noise by establishing a simulation, analyzes the size of the noise and its various possibilities, and selects the EEMD method as the English speech denoising algorithm. In addition, with the support of an intelligent denoising algorithm, this paper implements an English intelligent teaching system based on the recognition algorithm of English speech feature parameters. Finally, this paper evaluates the teaching effect of the intelligent English speech feature recognition algorithm proposed in this paper and the intelligent teaching system of this paper by means of simulation teaching. The research shows that the English teaching system based on the intelligent speech feature recognition algorithm proposed in this paper has a good effect.

1. Introduction

In English learning, language input and language output are closely related. Input is the premise and necessary preparation for output, and output is the ultimate goal of input and can also stimulate input. The two complement each other and jointly improve students’ comprehensive language ability. Although the Internet provides a large number of online learning resources, the individual differences in English proficiency of college students are quite large. Moreover, students are not very motivated to actively obtain language input, and their ability to learn online resources independently is not good. Therefore, college students in the Internet age still have the problem of insufficient effective language input.

The teaching hours and teaching capacity of English classrooms in colleges and universities are limited. Public English courses are mostly taught in large classes. The number of students in each class ranges from fifty to sixty to hundreds of students. One class is 90 minutes, and the average output of each student is the language output. There are very few opportunities, and the students’ language output in the classroom is far less than the language input, relying on the online learning platform to realize real-time online and offline interaction between teachers and students, expand teaching space and time, increase students’ language output opportunities on online platforms and offline classrooms, and help solve the dilemma of traditional English classrooms. The content and form requirements of language output are not random. It should form an organically linked whole with the content and form of language input. It is not only necessary for teachers to regulate the quantity and quality of language input. Make the content of language input materials meet the interests and needs of college students, and ensure that the difficulty level of language input materials is in line with students’ learning ability and learning level. It is even more necessary for teachers to properly supervise the online and offline language input process of students in classrooms and online platforms, promote students’ online and offline autonomous learning, and achieve effective language output.

Oral expression and written writings are the main forms of language output and the main way to test the validity of language input. When designing the teaching content of speaking, audio-visual, reading, and writing, attention should be paid to the relevance of thematic cohesion between modules to ensure that, under the same learning theme, students can learn vocabulary and sentence patterns layer by layer and practice repeatedly to gradually improve [1]. Oral content is mainly arranged in the course, and under the guidance of teachers, students train their oral expression or speech ability. In the case of students with low English proficiency, we can set up oral homework after class and encourage students to assist each other offline [2]. In the writing module, short videos, microlectures, or courseware materials for preclass preview can be provided for students to learn by themselves. If students have a poor English foundation, they should not set too many preclass writing tasks that are too difficult. We can focus on the input of writing knowledge, strengthen the teacher’s intensive teaching and guidance in the middle of the class, encourage students to write on the spot in the classroom, and complete effective language output [3].

In English teaching, listening teaching runs through the whole teaching process, and the application conditions of intelligent speech synthesis technology in English listening teaching are analyzed from two aspects of teaching activities and lesson types in the textbook.

This paper combines the method of speech feature parameter recognition to construct an intelligent English pronunciation teaching system and applies it in English teaching practice to improve the effect of modern English teaching.

The artificial neural network essentially imitates the human neuron structure, sets some judgments between the input and output, whether it is 1 or 0, and outputs data after a series of judgments and processing. After being able to solve some simple linear problems, it falls into the stagnation of development, and the XOR problem is difficult to solve [4]. The study of backpropagation has arisen. Backpropagation helps the model to learn the weight parameters independently, which solves the training problem of artificial neural networks. With the continuous development of Internet technology, artificial neural networks accumulate more and more data, and the learning speed is getting faster and faster. It has become a research hotspot in the world [4].

The deep learning method is developed on the model of artificial neural network, and the learning speed is faster. Literature [5] designed an 8-layer convolutional neural network for speech recognition. The deep learning model has strong learning ability and good calculation results, and the training difficulty is also better than that of artificial neural network, but because it is an emerging technology, there is still less research at present, and continuous exploration and attempts are required.

Reference [6] divides the nonstationary signal into several signals with small time intervals by adding a fixed time window to the signal, and the signal can be regarded as stationary within a sufficiently narrow time window, and then the Fourier transform is performed to obtain the signal. The time-frequency correspondence of the Gabor transform is proposed. On the basis of the Gabor transform, the signal is decomposed by using window functions of different sizes to obtain the time-frequency analysis method of the Short-Time Fourier Transform (STFT) [7]. STFT relies heavily on the selection of the size and shape of the time window, and the selection of the time window is unique, which also leads to the singularity of the resolution of the STFT, and the size of the window function cannot be adjusted according to the frequency transformation. Reference [8] proposed the Wigner-Ville Distribution (WVD), which performs Fourier transform on the signal by using the instantaneous autocorrelation function. It is also an effective method to deal with nonstationary signals because it does not use a window function. The signal is decomposed so that its time-frequency focus is high, but there are serious cross-interference terms. Reference [9] proposes a time-frequency analysis method of NLMS transform (Wavelet Transform, WT) by matching the NLMS basis function with the signal. The parent NLMS selected by the NLMS transform can be scaled and shifted, and the shape of the NLMS basis can be changed according to the frequency characteristics of the signal. Therefore, the time-frequency analysis results have a certain “zoom.” The NLMS transform essentially considers the signal in the time window to be approximately stationary, and sometimes the local signal time-frequency resolution deteriorates, which limits the application of the NLMS transform in seismic signals [10]. Reference [11] proposed S transform (ST), whose time window can be adaptively adjusted with frequency, which improves the defect of fixed time-frequency resolution in STFT and better meets the time-varying characteristics of seismic signals, but the fixedness of the selected NLMS basis function leads to certain limitations in its application. Reference [12] proposed the generalized S transform (GST), the window function of the generalized S transform can be flexibly adjusted with the frequency transformation, and the time-frequency analysis results with high time-frequency focus can be obtained. At the same time, the seismic signal can be flexibly analyzed according to actual needs.

In literature [13], the Hilbert transform is proposed, which is an analysis method suitable for spectral decomposition of seismic signals, which well excavates the physical significance of seismic signals and obtains three-instantaneous parameter attributes with geological significance. Reference [14] proposed the Hilbert-Huang Transform (HHT) on the basis of the Hilbert transform, which combined the Empirical Mode Decomposition (EMD) method with the Hilbert transform. EMD decomposes the signal into several subsignals, that is, Intrinsic Mode Function (IMF), and performs Hilbert transform on each M component to obtain the instantaneous properties of each subsignal. EMD decomposition has the advantages of completeness and self-adaptation but also has the defects of mode aliasing, endpoint effect, and insufficient sieving conditions. Reference [15] applied high-order spline interpolation to EMD to improve its calculation accuracy. Reference [16] improves EMD and proposes Ensemble Empirical Mode Decomposition (EEMD), which adds white noise to the original signal, which effectively solves the problem of modal aliasing, but also causes residual signal reconstruction and noise. Reference [17] proposed the Complete Ensemble Empirical Mode Decomposition (CEEMD), which is also a noise-assisted method. It aims to add white noise to the first segment of the decomposition, which greatly reduces the influence of noise. From the above examples, although the decomposition method of EMD has achieved certain results, the overall calculation process is complicated and difficult to handle because there is no mathematical theoretical basis in the process. Reference [18] proposes an adaptive and nonrecursive signal decomposition method, which can locate the signal homeopathic spectrum more accurately because the modes decomposed in this method have their own center frequencies, and the modes are different from the modes. It is very compact between states, which can effectively eliminate the modal aliasing phenomenon commonly found in EMD, CEEMD, and other methods.

3. Denoising and Recognition of Speech Waveform Features

Signal denoising is an important step in the whole system process, and it is a basis for the following work. Only by obtaining relatively pure self-mixing interference signals after denoising can the following theories and experiments ensure the rigor and reduce unnecessary errors.

EMD is a form of decomposing the entire signal data into multiple IMFs and then superimposing them on each other. By analyzing the IMF subsignals, respectively, a more meaningful frequency and power can be obtained. The whole is the screening process, which is to simplify the complexity. In the whole process, the signal is iterated countless times, which can eliminate its previous error and eliminate the waveform superposition. Compared with the former, it is more symmetrical, and it is also the significance of decomposing into multiple IMF subsignals.

EEMD is an Ensemble Empirical Mode Decomposition, and the EEMD algorithm is a noise-assisted data analysis algorithm proposed on the basis of the EMD algorithm, which makes up for the insufficiency of EMD. The processing method based on EEMD is to assist the analysis by adding noise. In the EMD method, the obtained IMF is the required condition, that is, the distribution of the extreme points of the signal. If the distribution of extreme points is not uniform, modal aliasing will occur.

The principle of the EEMD algorithm is similar to that of EMD, but different from EMD is that modal aliasing will occur in EEMD, which is also unique to EEMD. EEMD is based on EMD by adding white noise signals with different degrees of uniformity, and the white noise will be evenly distributed in each signal. According to the characteristics of zero-mean noise, each time the noise is removed, it is superimposed and finally averaged, which can cancel each other out. Simply put, the principle of EEMD is to add white noise on the basis of EMD and then remove it. The specific steps of the EEMD algorithm are as follows:(1)The algorithm introduces white Gaussian noise with spectral mean value of 0 into the signal to be decomposed x(t) and normalizes it to obtain the signal X(t), as shown in formula (1):(2)The algorithm performs EMD operation on the signal X(t) and decomposes it to obtain n IMF components and 1 participating component , as shown in formula (2):(3)The algorithm repeats test steps 1 and 2 and continues to introduce white noise that satisfies the normal distribution into the signal to be processed. The total number of tests is N, which can be obtained as shown in formula (3):(4)According to the characteristics of the zero-mean value made above, it can be canceled as nothing, and the influence of the added white noise on the signal can be removed. Finally, the IMF components and r(t) of each order are obtained by decomposition, as shown in formula (4):(5)The final reconstructed signal is shown in formula (5):

In the above formulas, is the j-th IMF component obtained by decomposing the i-th white noise, r is the time, N is the number of white noise sequences added, X(t) is the original signal, is the residual term after decomposition, and n is the number of times of decomposition. In formula (5), and are the IMF components of each order and the final residual components obtained after EEMID decomposition, respectively. For the convenience of understanding, the flowchart of EEMD decomposition is drawn as shown in Figure 1.

Signal denoising has always been a key area that people generally pay attention to. There are many denoising methods with different denoising effects and different degrees of simplicity. Therefore, how to choose a better method for denoising is very important, and denoising needs evaluation indicators to evaluate. Nowadays, the more practical evaluation indicators are signal-to-noise ratio, mean square error, and correlation coefficient. The following will use these three evaluation indicators to select and compare different denoising methods in order to select the optimal algorithm. First, the three evaluation indicators are introduced.

3.1. Signal-to-Noise Ratio (SNR)

The signal-to-noise ratio can be expressed as the ratio of two energies in the mathematical sense, and the expression is

In the formula, is the frequency of the original signal; is the power of the noise.

Among them, ;

From the current research, the signal-to-noise ratio is often used as a measure of noise. It can not only reflect the size of the noise but also use it to evaluate the effect of denoising. In theory, the higher the signal-to-noise ratio, the better the denoising effect.

3.2. Root Mean Square Error (RMSE)

The root mean square error usually refers to the ratio of the square of the deviation of the reconstructed signal and the original signal to the number of measurements. It can well express the precision of the measurement process and is also called the standard error. Its expression is

In the formula, f(i) is the original signal, is the decomposed reconstructed signal at a certain scale, and n is the total length of the signal.

From the current research, the mean square error is often used as a measure to remove noise. In theory, the smaller the mean square error, the better the denoising effect.

3.3. Correlation Coefficient

The correlation coefficient is a statistical indicator first designed by statistician Carl Pearson. It is a measure of the degree of linear correlation between variables. It is generally represented by the letter r, and its expression form is

In the formula, cov(X, Y) is the covariance of X and Y, Var[X] is the variance of X, and Var[y] is the variance of Y.

The correlation coefficient is often used as a strong indicator of how well the noise is removed. In theory, the larger the correlation coefficient, the better the denoising effect, and the smaller the correlation coefficient, the worse the denoising effect.

We add varying degrees of noise to the pure, undisturbed simulated signal. Moreover, we add three different levels of interference with SNR of 20 dB, 25 dB, and 30 dB to the signals of , , , , n = 2000, and C = 3, respectively, as shown in Figure 2.

It can be seen from Figure 2 that, with the increase of the added SNR, the noise becomes smaller and smaller. It can simulate different degrees of interference in real experiments, which is closer to reality and can conduct more comprehensive simulation and analysis so as to better determine the effective feasibility of the method. EEMD generates each IMF by decomposition, as shown in Figure 3.

The simulation of this paper is carried out on the operating platform of Matlab, and EEMD is used to denoise the self-mixing interference signal. In the simulation, we set , , , , n = 2000, C = 3, and and add Gaussian white noise with a signal-to-noise ratio (NSR) of 20 dB.

It can be clearly seen from the first 8 IMFs in the figure that the high-frequency signals are mainly concentrated in the first 4 IMFs, and the signals decomposed later are relatively pure signals. What we mainly remove is the high-frequency signal. It can be seen from the first four IMF diagrams that the noise mainly comes from the transition point, and the interference generated by the transition point is the mainstream interference. The key point to filter out is the noise of the transition point so that the transition point can be pinpointed when processing the signal, so the first 4 signals are filtered out, and all the remaining IMF signals are synthesized, as shown in Figure 4.

From the above figures, it can be clearly seen that most of the noise can be basically removed, and residue is the signal that removes the residual high-frequency noise. It can be seen that the noise is negligible, a pure interference signal is obtained, and the transition point is easy to see, which lays the foundation for the subsequent phase unwrapping and displacement reconstruction.

In order to observe the simulation effect more intuitively, we comprehensively compare the influence of each signal-to-noise ratio on the signal. The simulated signal parameters are , , , , n = 2000, C = 3, and . Table 1 shows the data of each denoising evaluation index when C = 3.

4. English Teaching Based on Speech Feature Parameter Recognition

When setting up teaching units in English courses, the topics of chapters can be modularized, and each teaching unit adopts a combination mode of life themes, workplace, or post themes. In addition, the commonly used teaching content can also be selected according to the module combination of vocabulary, audio-visual, speaking, reading, writing, grammar, and translation (interpretation and translation). Each learning module is configured with teaching content according to the combination of before class, one class and one after class, online learning + online tutoring, and offline teaching + offline activities. According to the teaching content, design various activities of online teaching and classroom teaching. At the same time, it is necessary to optimize the online learning mode and teacher-student interaction mode of the course, broaden the path of language task output, and design corresponding homework, tasks, and their supporting evaluation methods. The overall framework of the modular hybrid teaching model is shown in Figure 5.

There are four teaching layers in Figure 6, which are cognitive construction layer, embodied learning layer, situation interaction layer, and innovation generation layer. The cognitive construction layer and the embodied learning layer are the process layer of acquiring knowledge and skills, the situational interaction layer is the situational layer of problem solving, and the innovation generation layer is the practical layer of knowledge and ability. In the process layer, teachers are students’ facilitators, supporters, and classroom observers. Students actively acquire tools and materials through “physical skills,” actively seek help and support from human resources, and complete the construction of self-cognition through repeated iterations of action and thinking. At the situational level, teachers use examples familiar to students to introduce teaching, creating a learning atmosphere and motivation for students. Moreover, they set up problem situations to make students live in the teaching content, let students understand the truth that learning is life and life is learning, exert students’ associative ability and innovative thinking, and stimulate students’ strong desire to solve problems. In the practice layer, students use imagination and open thinking in brainstorming to conceive a variety of implementation plans and possibilities, open up new ideas for problem solving through hands-on practice, and creatively organize experience and knowledge to solve new problems. The implementation of teaching links, the growth and development of students, and the improvement of ability and literacy all run through the four teaching layers in the model. Each of the four teaching layers plays a different role. Among them, the cognitive construction layer is a collection of experience and creation, and experience is transferred and applied in innovation. The embodied learning layer gives full play to students’ initiative and encourages students to use their hands and brains and experience learning through “doing.” The situational interaction layer realizes the transfer of the situation, which can incite the emotion of the students, arouse the students’ resonance, strengthen the interaction of the classroom, and make the classroom teaching start, proceed, and end under the sense of substitution of the situation. The innovation generation layer is the practice field for students to solve problems and an important activity layer for cultivating students’ innovative thinking and creativity. Each layer is independent of each other but not isolated and disconnected, and multiple teaching activities may be carried out in one teaching layer. For example, in the cognitive construction layer, there will be learning activities such as experience reproduction, comprehension of new knowledge, and design and creation. At the same time, the same teaching task will be completed in more than one teaching layer, such as students’ inquiry activities occurring in the cognitive construction layer, the embodied learning layer, and the innovation generation layer. The four teaching layers are an important part of the migration teaching model, which echoes the top layer of the model, conforms to the theoretical concept of this study, supports the implementation of the three paths in the middle layer of the model, and is a teaching field that nurtures teaching, development, and ability literacy.

Speech recognition technology, broadly speaking, refers to semantic recognition and voiceprint recognition. In a narrow sense, it refers to the understanding and recognition of speech semantics, also known as Automatic Speech Recognition (ASR). As the most natural way of human-computer information interaction, the basic idea of speech recognition technology takes speech as the research object and converts the input speech signal into corresponding text commands through the process of machine recognition and understanding so as to realize the control of the machine by speech. In the development of speech recognition technology, although different researchers have proposed many different solutions, the basic principles are the same. In the processing of speech signals, any speech recognition system can use Figure 7 to represent its general recognition principle. The most important modules of speech recognition system are speech feature extraction and speech pattern matching.

In the view of behaviorist learning theory, human learning is a process of stimulus-response-reinforcement. In the learning process, there is only stimulus-response, and the lack of reinforcement will greatly reduce the learning effect. However, if we blindly provide stimuli and ignore the learner’s original knowledge level, learning needs, and interest in learning, it will be difficult for the stimulus to stimulate the learner’s response; that is, it is difficult for learning to occur. On the other hand, because people’s information processing methods are different, if only a single learning stimulus is provided and the differences in the learners’ individual information processing methods are ignored, it will lead to difficulties for learners to process new knowledge using their original cognitive experience. The manifestation is that it is difficult to learn new knowledge. At the same time, in English pronunciation learning software, many learning systems lack an overall score for the learner’s phonetic pronunciation, nor can they give timely and accurate feedback on the learner’s pronunciation. Moreover, it does not provide corrective guidance for learners’ mispronunciation, and it is difficult for learners to find and correct their own pronunciation errors due to their limited level.

The integration of information technology and English courses is not simply to present the learning content on the computer, but the following points need to be considered. One is to use information technology and multimedia technology to systematically integrate the knowledge of English phonetic transcription. The second is to present knowledge to best promote learners to absorb. The third is to strengthen and consolidate the learning of the learners. In view of the problems existing in the above-mentioned English phonetic pronunciation learning software, behaviorist learning theory, multimedia cognitivist learning theory, and related theories of information technology and curriculum integration should be taken into account when designing the platform. In the implementation of the system, by constructing an English phonetic transcription learning environment, learners can choose the learning content independently. The system presents learners with various forms of learning stimuli, which facilitates learners to link old and new knowledge through various information processing channels and form their own knowledge structure. At the same time, the system uses template matching technology in speech recognition to compare the difference between the tester’s pronunciation and the standard pronunciation and strengthens the learning results by testing the learner’s pronunciation and giving pronunciation feedback. This process of stimulation, response, connection, and reinforcement is more in line with the law of human learning, so it can better promote the occurrence of learning. As mentioned in the review, the English speech learning assistance platform studied in this paper is shown in Figure 8.

On the basis of the above research, the intelligent English speech feature recognition algorithm proposed in this paper and the intelligent teaching system of this paper are evaluated by means of simulation teaching, and the results shown in Table 1 and Table 2 are obtained.

From the above research, it can be seen that the English teaching system based on the intelligent speech feature recognition algorithm proposed in this paper can not only effectively improve the effect of spoken English feature recognition but also effectively improve the effect of English teaching.

5. Conclusion

The application of intelligent voice technology in education and teaching is never to replace the role of teachers but to help teachers carry out teaching, make up for teachers’ deficiencies, improve teaching efficiency and effect, and achieve the effect of “reducing burden and increasing efficiency.” The application conditions of technology in education and teaching also need to be analyzed. Under the application conditions, teachers can play their original role, and technology can also reflect the functional advantages of technology itself. In English listening teaching links or teaching scenarios, intelligent speech technology can give full play to its technical advantages and better assist teachers in English listening teaching, so that the role of teachers and the functions of intelligent speech synthesis technology can complement each other. In this paper, an intelligent English pronunciation teaching system is constructed by combining the method of speech feature parameter recognition. The experimental study shows that the English teaching system based on the intelligent speech feature recognition algorithm proposed in this paper can not only effectively improve the recognition effect of spoken English but also effectively improve the effect of English teaching.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The study was supported by the “Teaching Reform Project of Higher Education” in Heilongjiang Province, China, and Construction of English Teaching Model Based on Blue-Ink Cloud Class + BOPPPS in Application-Oriented Universities (Grant no. SJGY20190707).