Under normal circumstances, in the process of evaluating the phoneme of spoken English which is easy to be confused, it is impossible to accurately assess the oral ability of different groups of people according to their oral ability. The assessment process has problems of poor robustness and low stability. We propose a spoken English assessment method based on an easily confused phoneme assessment model to address these problems. We design an English easily confused phoneme-based evaluation model in the proposed framework by adopting fuzzy logic for the assessment task. We also present HPD set for confused phonemes and introduce the easily confused phonemes in spoken English. Moreover, we derive four fuzzy measure assessment grades of E/G/NI/GR and present the assessment model for them. We continuously recognize and annotate spoken English to find the best-matched statement and complete the recognition and assessment of easily confused phoneme. Then, we also focus on spoken English assessment based on an easily confused phoneme assessment model. Empirical results demonstrate the superior performance of our proposed models over the conventional evaluation methods. Our proposed models improve the spoken English assessment method by 30% and the stability by 45%. Besides, our model is also suitable for the spoken English assessment of different groups of people.

1. Introduction

Conventional assessment of easily confused phoneme of spoken English cannot be accurately evaluated based on the different speaking ability of different groups of people. Its applicability is narrow, and its robustness and stability are low [1]. In this paper, we propose a spoken English assessment method based on an easily confused phoneme assessment model. We design a fuzzy logic μ-based Sugeno integral [2]. Furthermore, we integrate the Sugeno integral framework with a customized HDP set of confused phonemes. Then, our model proposes four kinds of fuzzy measure ratings (E/G/NI/GR) to evaluate the language score [3].

We design an easily confused phoneme evaluation model. For the assessment model with “simple word list grammar” syntax, we collect Chinese-based HDP and classify them into various HDP sets using Fourier transform and Mel cepstrum filtering. Each HDP set comprises phonemes that are not easily recognized by Chinese students [4]. The credibility of the assessment model to discriminate different HDP sets is based on the standard corpus, and the phoneme recognition results are integrated into the Sugeno integration framework [5]. Based on the algorithm of finding the maximum matched statement, we use liaison annotation and liaison recognition of spoken English and process the recognition of HDP in batches [6]. To ensure the effectiveness of the proposed method, the population test environment of spoken English is simulated. Two different assessment methods of spoken English are used for robustness and stability. Experimental results show that the proposed spoken English assessment method is highly effective [6].

2. Building Assessment Model of English Easily Confused Phoneme

2.1. Determining Fuzzy Logic of Assessment

In the real world, everything cannot be described by exact numbers. For example, there is no exact value for how many degrees of heat and how much space is enough for the sea, for example, hot weather and the great sea. To model such a scenario, fuzzy logic was introduced [7]. Fuzzy logic is a mathematical method to describe an uncertain problem [8]. The fuzzy set is introduced as follows.

The definition of complement, inclusion, union, and the intersection of a fuzzy set is given below: where A’ represents a complement of fuzzy set A [9].

B includes A if and only if

Fuzzy integral is a crucial application in the field of fuzzy sets. In the existing fuzzy integral operation, Sugeno integral is the most popular integral operation [10]. We have given below a brief overview of some essential concepts used in the easily confused phoneme evaluation model.

Assume X is a non-empty finite feature set, and 2x is its power set. Then, represents the reliability of the element . is the fuzzy measure over . Then, if it satisfies [11]:

Similarly, for , if , then . We can follow the definitions of the intersection, union, not operation described in [12] as follows:

Let , and be the fuzzy measures over , then the Sugeno integral of the function of the fuzzy degree is defined as follows:where , and .

2.2. Introduction of Easily Confused Phoneme in Spoken English

For various foreign language learners, a few phoneme sets are always indistinguishable. Each such set is called an HDP (phoneme that is hard to distinguish) set. For example, for most Indian speakers, it is complicated to distinguish between English phonemes /t/ and /d/. For Chinese English learners, it is not easy to distinguish between /w/ and /v/ in spoken English [13]. It is instrumental for language learners to improve their speaking skills and their ability to understand the foreign language if they can successfully master the pronunciation skills of different HDP sets. Providing accurate HDP assessment results and feedback to language learners is also an essential requirement for speaking assessment. For different native speakers, HDP sets are usually different. In this paper, we discuss the problem of native Chinese English speakers. The established assessment model can also be applied to other non-native English speakers [14].

To reduce the error recognition rate and consider the difference between SR and language learning (LL), it is inappropriate to use the existing SR framework directly to identify which phonemes were pronounced by the language learner [15]. Therefore, it is implemented by other methods in this paper.

Figure 1 describes the HDP sets used in our model. To make the figure clear, we introduce two nodes. One is the “begin” node, which occurs before pronouncing the first word of the sentence. The other is the “end” node, which occurs after pronouncing the last word of the sentence. Since the assessment model has been provided with all the possible pronunciation before recognition, the actual pronunciation of the practitioner can be easily detected by the assessment model [16].

The HDP assessment task can be described as providing an HDP cluster script to the speaking practitioner, and then the script is recorded when the speaker reads these sentences [17]. Then:(1)The actual pronunciation of each HDP of the practitioner is annotated(2)According to the standard phonetic string found in the dictionary, the proportion of correctly identified HDP and erroneously identified HDP is statistically calculated(3)The language score and feedback were provided to the language practitioner

The most challenging problem of assessing pronunciation level based on the pronunciation of language learners is the instability of the speech processing evaluation model [18]. The HDP recognition results of the local recording of 1,032 sentences were statistically analyzed to illustrate this problem. Considering the native speaker’s pronunciation is usually correct, the native pronunciation corpus is considered the standard corpus in this paper. The recognition result of the SR assessment model is obtained based on the recognition result of the standard corpus. Table 1 shows the statistical results. The meanings of the symbols in Table 1 are as follows:(i) represents the phoneme in the corpus set(ii)q represents the actual recognition result for the phoneme assessment model(iii)n represents the number of different recognition results of q(iv)nt represents the number of occurrences of the phoneme in the corpus

2.3. Determining a Fuzzy Measure of an Easily Confused Phoneme Assessment Model

In order to evaluate the reliability of the assessment model, two relative measures are introduced: the correct recognition rate and the false recognition rate that are defined as follows:

where is the number of phonemes that are correctly recognized, and is the number of phonemes that are erroneously recognized. The function is the number of phonemes being identified as the , but actually being the phonemes, . For example, in Table 1, , then:

The HDP set is an attribute set . represents the phoneme , and other placeholders can be derived. The fuzzy measure will only depend on the potential of an attribute set. The HDP assessment is still using a fuzzy approach. There are four assessment levels, which are excellent, good, medium, and need to be improved (NI). The actual meaning of fuzzy measure is that a speech instance belongs to or is better than some assessment level. The definitions are given as follows:(i)Fuzzy measure of “belong to or better than ‘medium’”:where is the length of the HDP cluster, and is the subset of attribute set composed of the HDP placeholders.(ii)Fuzzy measure of “belong to or better than ‘good’”:(iii)Fuzzy measure of “belong to ‘excellent’”:

Fuzzy measures of medium, good, and excellent are shown in Figure 2.

2.4. Completing the Building of the Evaluation Model

In order to obtain robust assessment results, we consider the credibility of different speech-processing evaluation models using two measures: and For HDP assessment, these two cases must be considered independently.(1)If the phoneme is correctly identified, the credibility of the phoneme is defined as:(2)When the phoneme is recognized as ith, the phoneme belongs to the same HDP set, rather than the itself. The credibility of the phoneme is defined as:

3. Easily Confused Phoneme Assessment Model

Because of the recording module for the test set in this paper, the configuration of the front end of the evaluation model is the recording mode recognition [19]. The overall architecture of the assessment model is shown in Figure 3. The subprocessing units of pre-enhancement, windowing, Fourier transform, Mel cepstrum filtering, discrete cosine transform, batch CMN, and feature extraction were used for speech signal processing.

The assessment model is required to change the system’s syntax at any time, so it is necessary to improve its syntax construction to accept the new syntax dynamically. Thus, we adopt and improve “simple word list grammar” and add the construction method of the search tree with the grammatical sentence [20].

3.1. Annotation and Recognition of Liaison in Spoken English

To enable Sphinx-4 to recognize liaison, it is first necessary to accept the grammar of liaison and then build a search grid according to the input syntax node. The liaison annotation module or the HDP annotation module completes the expansion of the syntax node. As the assessment model does not know the input syntax, the vocabulary size of the assessment model and the acoustic model library should be large enough. The used dictionary type is FullDictionary as it contains more detailed phoneme information of words. WSJ-8gau-13dCep-16k_40me1-130Hz_6800Hz.jar is used as the acoustic model library of the assessment model. The aim is to make the acoustic model enough so that the assessment model can find the desired acoustic model [15].

The implementation method is to increase the ability to generate new syntax nodes dynamically based on the original “simple word list grammar” class so that the assessment model can accept the syntax nodes processed by the liaison rules. Then Sphinx-4 system is used to recognize the corresponding results for the input speech. Since the assessment of liaison is carried out, the recognition result needs to be compared with the statement processed by the liaison rules to give an assessment result. For liaison, the core module is liaison annotation and recognition result analysis.

The first is the liaison annotation. The model’s primary function is that for a given syntax statement, all the possible liaisons have to be annotated in the grammatical text according to the existing liaison rules. Moreover, all possible liaison extensions have to be generated, and the new synthetic words have to be added to the dictionary of the assessment model. In this evaluation model, this function is implemented by “liaison marker” and “liaison rule” classes. The liaison rule is stored in a hash table in the liaison rule class. The rule of storage is that two liaison phonemes are underlined and used as key values. In the hash table, the key value is the pronunciation of the two phonemes after liaison. For example, there is an entry “KAH K AH” in the file of liaison rules. After the hash table is loaded, the corresponding entry is the key value of “K_AH”. In this manner, the liaison rules can be used. For an input statement, liaison marker first uses the dictionary in the assessment model to detect each word’s phonetic symbols and then process every two adjacent words in sequence according to the liaison rules. If it confirms the liaison rules, the first word is annotated as the liaison available for all the liaison possibility extensions and new word extensions. Then the annotated sentence is extended, extending the statement to a set of statements that can contain all the possibilities of liaisons.

The second is the dynamic addition of words. To make the assessment model able to detect and identify all the possibilities of liaison, we need to add these liaisons to the knowledge base of the assessment model. If two words, such as “link” and “up”, are adjacent in the input statement. By examining the dictionary of the assessment model, the last phoneme of the first word is /K/, and the first phoneme of the last word is /AH/. By looking at the hash table of the liaison rules, the key value is “K AH”, and the corresponding value is “K AH”, that is, two phonemes can be linked to read, and the pronunciation after liaison is “K_AH”. In this way, a new word is formed and added to the dictionary. According to the “link” and “up” entries “LINK LIH1 NG K” and “UP AH1 P” in the original dictionary, they are merged into a new word, “link_up”, and its phonetic symbol is “L IH1 NG K AH1 P”. The acoustic statistical model corresponding to each phoneme is then associated with the phonetic symbol. In this way, new words are added dynamically to the assessment model. If the assessment model identifies the input speech feature frames, the most matched feature frames of the newly synthesized words are calculated, which means that the assessment model can recognize the liaison.

The last is the analysis of the recognition results. Although the recognition rate of the Sphinx-4 system for the speech signal with a small vocabulary is very high, the error is unavoidable. For example, to recognize the recording of the sentence “that should be good enough for us”, the result is probably “that that should be good enough_for us”, which cannot be evaluated. We need to normalize it and find out the sentence in all liaison extension sentences that matched the recognition result as the final recognition result. The algorithm to find the maximum matched statement is as follows:(1)Obtain the recognition results and all liaison extension sentences list (allLiaisonSentence (string list)).(2)Set the maximum length, Max Len = 1, and matching result string (result Sentence (string)) to null.(3)Take the unprocessed one-sentence extension in allLiaisonSentence and store it in standard Sentence (string).(4)Compare recognized result and standard Sentence. Find the location of each word or the liaison cluster in the standard Sentence from the recognized result and store it into the integer array match.(5)Find the most extended monotonically increasing sequence in the match and denote the length as len. The elements that do not appear in this sequence are set to −1.(6)If len is greater than Max Len, then Max Len is set to len, and result Sentence is set to standard Sentence.(7)Add the mark of having been processed to the statement in allLiaisonSentence.(8)If all the statements in allLiaisonSentence are processed, turn to step (9). Otherwise, turn to step (3).(9)Output result Sentence. The algorithm ends.

At the same time, the assessment model also increases the rejection function. When the number of words in the recognition statement is less than 60% of the sentence, the assessment model refuses to recognize it.

After finding a maximum matched statement, it is compared with a statement that connects all the segments. These segments can be read together to see whether each liaison and its type are recognized. Then the final assessment result is obtained based on the Sugeno integral speech assessment algorithm.

An example of an assessment is given, and local recording is used. The recording text is “that should be good enough for us”. The recording format is PCM_SIGNED, 16000.0 Hz, 16bit, mono, little-endian. The input recording text works as a grammar statement and then generates the syntax node according to the liaison rules. That is, {that, should, be, good, enough, for, us, that_should_be, that_should_be, good_enough, enough_for, for us, good_enough_for, enough_for_us, good_enough_for_us}. The original 7 grammar words are extended to 16 grammar nodes according to the liaison rules. The new compound word is then added into the dictionary of the assessment model to produce all possible methods of liaison pronunciation. Each liaison pronunciation is connected from a grammatical node on a possible path from the “begin” node to the “end” node. Space separates each node. There are 32 ways of pronunciation. Then the assessment model builds up a search grid based on syntactic nodes and identifies the input recording. The output recognition result is “that should be good_enough_for us”. The recognition result is compared with 32 known ways of pronunciation, and one of the best matching results is found. To make the match more accurate, the number and position are used as the matching object. For “that should be good_enough_for_us”, the shortest path of this syntax consists of two liaison groups of “that should be” and “good_enough_for_us”, with the length of 2 and 3. The first group is analyzer, where is the CC liaison type, and is the CF liaison type. The actual result of recognition is that two places are not linked to being read. According to the reliability building method, . After the Sugeno integral of this liaison group, the assessment result of the liaison group is “good”. The assessment result of the second liaison group is “excellent”.

3.2. Recognition and Assessment of Easily Confused Phoneme

Sphinx-4 system is unable to identify easily confused phoneme, and it also needs to improve its grammatical structure to recognize it. The implementation of the assessment model is similar to that of liaison recognition, but the details are different. The main classes are HDPMarker and HDPRule.

First, new words and the form of new words are added to the dictionary of the assessment model. To carry out the assessment, the assessment model must be made to determine which phoneme the spoken language practitioner has made and then compare the identified phoneme with the standard phoneme.

Unlike extensions of liaison, the unit of expansion here is limited to the phonemes of each word. For each phoneme, by looking up all the easily confused phonemes and then replacing them one by one, all the possible pronunciations can be obtained to make up a new word. For example, in “this is where I work”, for “this”, its phonemes in the Sphinx-4 dictionary are “DH IH S”, while the HDP set has the rules “DH Z”, “IHIY”, and “S TH”, so that the pronunciation of the word can be expanded to 8. Then, the phonetic symbols of each newly extended word are linked to the acoustic model and added to the dictionary of the assessment model. In this manner, if the recognition result of the input speech is an extended new word, the assessment model can detect whether the speaker is wrong and then give it an evaluation result.

The last is the analysis of the recognition results. The recognition results of the assessment model also need to be normalized. Its standard algorithm is the same as the maximum matching algorithm of liaison. If the number of words matching is less than 60%, the assessment model refuses to recognize. The difference is that the recognition result is only compared with the standard phonetic string. For judging whether the two words are the same or not, it is not to determine whether the spelling is the same and consider the easily confused phoneme set. The phonemes in the same HDP can be regarded as the same phoneme. Another difference is that the assessment of confusing phoneme is based on the HDP cluster. After all the sentences in the cluster have completed the maximum matching, then the assessment of the liaison cluster is carried out, and an assessment result based on linguistic variables is given.(i) That should be good enough for us.(ii) That’s the pleasantest part of it.(iii) She’s his sister.(iv) He has five thousand pounds a year.(v) I just had to come in and tell you the news.

An HDP cluster given in the statement corresponds to the HDP set of {/i:/i/}. Since the cluster assessment involves five statements, the assessment model needs to be processed one by one. Take “she’s his sister” as an example to illustrate how the assessment model processes every sentence in the HDP cluster. First, the phonetic symbols of all the words in the sentence are found out in the dictionary, and then the phonemes are connected with an underline to form a new spelling form. The original phonetic symbols are added to the new synthetic words to form a new dictionary entry and add to the dictionary. Then the word of the grammar statement is expanded based on the HDP set. Taking “his” as an example, it can be expanded to phoneme mode: {f HH_IH, HH_IY, HH_IH DH, HH_IY_DH}. Other words are expanded following this rule.

Finally, using all the extended syntax nodes, the syntax search tree is established; the input recordings are identified; and the recognition results are obtained. For the recognition result, it still needs to find its maximum matching. “HHIY DH” and “HHIH Z” will be considered as a word by the assessment model because they originate from the extension of the same word “his”. After finding the maximum matched statement, the actual pronunciation of the phonemes in the HDP set is compared with the standard statement and the recognition statement. After the assessment model has finished all the statements in the HDP cluster, a set of credibility with a prescribed length is obtained, and then Sugeno integral of the set is carried out to obtain the assessment result.

4. Experimental Analysis

In order to verify the effectiveness of the proposed method, simulation analysis is carried out. A group of people with different spoken English abilities is selected to conduct robustness and stability simulation experiments. A simulation study is conducted on the different gender and academic qualifications of the spoken English ability group. The results of simulation experiments are compared with the conventional assessment method. According to the given assessment model of liaison and easily confused phoneme in spoken English, experiments on the assessment of easily confused phoneme are performed to verify the effectiveness of the assessment model. The HDP model training experiment and the validation of the assessment model are described, and the results are analyzed.

4.1. Robust Liaison Experiment and Result Analysis

The recording format of the corpus used in this experiment is PCM_SIGNED, 16000.0 Hz, 16bit, mono, little-endian. The raining corpus T1 consists of 1,032 native spoken language recordings and their scripts. The test corpus with 100 native spoken natural language recordings is denoted as T2.

The liaison experiments include model training and model validation. The former is used to train the fuzzy measure of the assessment model and evaluate the credibility. The latter is used to verify the validity of the proposed model.

The training corpus T1 is used to obtain the credibility of the assessment model. The performance of the Sugeno integral is affected by determining the fuzzy measure and evaluating the credibility of the model. T1 is also used for the closed test. T2, T3, and TN are used for development tests. First, model training is introduced, and the process of training is as follows:(1)Linguistics experts annotate the liaison in the corpus T1, which is taken as standard liaison annotation(2)All the speeches in T1 are batch-processed to obtain the results after the normalization(3)The results of recognition and artificial annotation are compared to obtain the number of different liaison combinations(4)The results of recognition and artificial annotation are compared to obtain the number of correct, false, or missing identifications of different liaison types

Tests comprise the open test and the closed test. For the open test, the language material other than the training corpus is selected to test the assessment model. For the closed test, the training corpus is used to test the assessment model.

The results of the liaison assessment model based on Sugeno integrals are shown in Table 2. The model test results are shown in Figure 4.

In Table 2, GR is the ratio of “good”, defined as (G + E)/liaison groups. From Table 2, it can be seen that for the closed test T1, the “good” and “excellent” of the model output reach 78% of the total. For the open test T2, the ratio is 76%. The results show that the liaison assessment model has good robustness. The open test T3 is the same as the T2 test. The conventional evaluation method has low comprehensive performance, and the “good” and “excellent” outputs of the assessment model are only 45%. The robust of the proposed method is improved by 30%.

4.2. Stability HDP Experiment and Result Analysis

The training corpus T1 consists of 1,032 sentences of native natural speaking recording and script. The test corpus is composed of 122 HDP clusters of native natural speaking recording, denoted as T2. Two students were chosen: one is graduated in English major, and the other is graduated in computer science. Their speaking ability is different from each other. The recordings of the same corpus of the students are denoted as T3 and T4, respectively.

HDP experiments also include model training tests and model validation tests. The corpus T1 is used for the model training test. T2, T3, T4, and TN are used for model validation tests.

First, the model training experiment is described as follows:(1)For the given recording script and HDP set, the HDP assessment model is used to identify the corresponding recordings and obtain the recognition results(2)By comparing the corresponding phonemes in recognition strings and annotation strings, the statistical data of the recognition results of different phonemes are obtained

Then the model verification experiment is given. T2, T3, and T4 are used for development tests. Considering the significant difference in speaking proficiency of different people, two spoken English assessment methods are applied. The assessment results of the corpus set T2–T4 are shown in Table 3. The precision validation under different data sets and the different parameter values are shown in Figure 5.

Table 3 shows the speech evaluation results of T2, T3, T4, TA (the proposed method), and TN (conventional assessment method). From Table 3, it can be seen that the assessment results of TA agree with the expected results. The evaluation results of TN deviate from the predetermined results. By weighted arithmetic analysis, the reliability of the proposed method is increased by 45%.

In this section, the experiment of testing liaison and HDP in spoken English is given. The former used development testing and closed testing. The experimental results show that the proposed spoken English assessment method has high robustness. The latter mainly tests the stability of the algorithm. Experimental results show that the proposed spoken English assessment method is highly reliable.

5. Conclusions

This paper proposes a spoken English assessment method based on an easily confused phoneme assessment model. We design and implement the English easily confused phoneme assessment model, presenting the assessment model’s configuration and the related recognition results. Experimental results show that the proposed method is very effective. The research in this paper can provide a theoretical basis for the assessment of easily confused phonemes of spoken English.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.


This research was financially supported by Chongqing Educational Science Planning Project (No. 2020-GX-325), the Humanities and Social Sciences Research Foundation of Chongqing Municipal Education Commission (No. 18SKH142), and the Fundamental Research Funds for Yangtze Normal University (No. 2016XJQNO3).