Abstract

Error detection and accuracy estimation in automated speech recognition (ASR) systems act a vital part in the design of human-computer spoken dialogue systems, as recognition error can hamper accurate systems in understanding the end user intentions. The major aim is to identify the errors in an utterance, and therefore, the dialogue manager can provide proper clarifications to the user. Therefore, the design of accurate error detection and accuracy determination techniques becomes essential in the ASR system. With this motivation, this paper presents a novel artificial intelligence-enabled accuracy estimation and error detection technique for the English speech recognition system (AIEDAE-ESRS). The goal of the AIEDAE-ESRS technique is to perform three actions such as confidence estimation, out-of-vocabulary (OOV) word identification, and error type categorization. In addition, the AIEDAE-ESRS technique performs different levels of preprocessing including sampling of input speech signal, bandpass filtering, and noise removal. Besides, a new deep neural network with hidden Markov model- (DNN-HMM-) based speech recognition technology is designed, which also aims to estimate the accuracy and error. Finally, the hyperparameters of the DNN-HMM model can be optimally chosen by the use of flower pollination algorithm (FPA) and thereby accomplished improved recognition performance. In order to demonstrate the better performance of the AIEDAE-ESRS technique, a series of simulations were conducted and the results are examined under varying aspects. English voice recognition system’s accuracy estimation and error detection were made possible using artificial intelligence (AIEDAE-ESRS). There are three steps in the AIEDAE-ESRS method: confidence estimation; identifying out-of-vocabulary words (OOV); and categorizing mistake types. The simulation results reported the enhanced performance of the AIEDAE-ESRS methodology over current advanced approaches. Our AIEDAE-ESRS methodology outperforms existing methodologies by a factor of ten. The simulation results demonstrated that the AIEDAE-ESRS methodology outperformed previous approaches in terms of efficiency. The improved experimental results indicated that the AIEDAE-ESRS technique produced superior results across a variety of measures.

1. Introduction

The speech signal is one of the essential and common ways of communicating between people. In these communications, the speaker’s emotion performs an important role in the transfer of concept in such a way that a change in the emotions may result in distinct translations of speech [1]. Therefore, to make effective communication between man and machine, speech emotion recognition (SER) is becoming a hot research topic. In the selection of important features, together with accurate SER system, an effective way to decrease the data dimension is needed [2]. With the continued growth of science and technology, the global village is shrinking, and the usage of English has become increasingly widespread. The development of artificial intelligence computers that could understand English speech will significantly encourage the new experience and complete intelligence of human life and work eventually [3]. The speech emotion recognition (SER) system is built on CNNs and RNNs that have been trained on a database of emotional speech. Our primary objective is to offer a SER approach that is based on concatenated CNNs and RNNs and does not rely on any typical hand-crafted features. The literature on speech emotion recognition (SER) has employed a variety of approaches to extract emotions from signals, including numerous well-known speech analysis and classification techniques. Recently, deep learning approaches have been presented as a possible replacement for classic SER techniques. Language interaction and intelligent English speech recognition systems (SRS) affect their study and work life, as well as have promotion significance and extensive application in areas like language promotion, military, and education. Now, there are multiple implementation methods and system designs for SRS. There are different kinds of classification, primarily separated into specific-and nonspecific-persons SRS, continuous and isolated word SRS, embedded/server SRS, small vocabulary, and large vocabulary SRS. In everyday life, people’s natural speech is depending on the speaker’s need to break at the end of a sentence or add punctuation, and other parts could be continuously pronounced [4].

In the earlier SRS, the isolated word phonetic systems were based primarily on single words or characters [5]. Depending on the way the acoustic method is developed, we could separate SRS as specific- and nonspecific-person recognition. Specific-person recognition implies that the user needs to input a massive number of pronunciations and train recognition in advance. The nonspecific-person is that afterward the scheme is developed, the user does not need to input the trained information before and could recognize directly [6]. The deep learning (DL) method has different areas of application, and several achievements have been found. Another area where DL is effectively used is automated SRS. In automated SRS, better language and acoustic methods are integrated [7]. The SRS problems involve time-series data. In several fields, such as read continuous speech where usually the speech is recorded under clean conditions, the outcomes are satisfied with an error rate under 5%. Since in another field that has high speech differences, like distant conversational speech (meeting) or video speech, the outcomes are still not satisfactory exhibiting 50% of an error rate [8].

To handle these problems and improve the performances of inaccurate ASR systems, the automated correction and detection of the transcript error could be the only choice in some cases [9], especially while tuning the ASR systems by itself is impossible (for example, the system is purchased as a black box) or the manual correction is inconvenient or even impractical as in the case where the transcriptions are not the ultimate objective of the systems (for example, question answering, machine translation, and information retrieval systems). In that respect, ASR classification and error detection are also called confidence estimation [10]. The more commonly studied method is feature-based, where classification is constructed by the feature generated from distinct sources (that is, decoder and nondecoder characteristics) to differentiate the accurately from the inaccurately identified word.

This paper presents a novel artificial intelligence-enabled accuracy estimation and error detection technique for the English speech recognition system (AIEDAE-ESRS). The AIEDAE-ESRS technique intends to accomplish three actions such as OOV word identification, confidence estimation, and error type categorization. Furthermore, the AIEDA-ESRS technique’s architecture incorporates a deep neural network with hidden Markov model- (DNN-HMM-) based speech recognition model. Furthermore, the flower pollination algorithm (FPA) is used to fine-tune the DNN-HMM model’s hyperparameters. Flower pollination algorithm (FPA) is a nature-inspired metaheuristic algorithm that replicates the pollination activities of blossoming plants. The implementation of several FPA variants based on tweaks, parameter adjustment, and hybridization with other algorithms is addressed in this article. The design of FPA for hyperparameter optimization of the DNN-HMM model shows the novelty of the work. The experimental result analysis of the AIEDAE-ESRS technique takes place using benchmark dataset and investigated the results under several aspects.

2. Literature Review

In Alhamada et al. [11], the usage of DL in SRS was examined and an appropriate DL framework has been was recognized. A technique using CNN is employed to improve the efficiency of SRS. Han et al. [12] examined the efficacy of different DL-based acoustic models for conversation telephone speech, especially CNN-bLSTM, bLSTM, and TDNN systems. They estimated this model on research test sets, like recordings, Switchboard, and CallHome from a real-time call center applications. In Blaise. O. Yenke et al., due to the large variety of applications and interfaces or computing equipment that can enable speech processing, automatic speech recognition (ASR) is a very active research subject. It is true that well-resourced languages outnumber underresourced languages in most applications. It is evident that ASR may be used to enhance illiterate people’s languages. Starting with a small vocabulary is one way to construct an ASR system for underresourced languages. Assertive speech recognition (ASR) with a limited vocabulary recognizes words or sentences in small groups.

Grozdić et al. [13] extended a method for whispered SRS that is the most difficult challenge in ASR. Specifically, because of the profound variances among acoustic features of whispered and neutral speech, the efficiency of conventional ASR system trained on neutral speech greatly reduces once whisper is used. Misbullah et al. [14] investigated the efficiency of SRS for dysarthric speakers using time delay DNN. Furthermore, examine the system performances by integrating dysarthria and normal speech corpus. Lastly, well-tuned hyperparameter of DNN structure gives potential outcomes on English dysarthria and Mandarin speech.

Ogawa and Hori [15] explored three kinds of ASR error detection processes, that is, OOV word recognition, error type classification (ETC), and confidence estimation, and also evaluated the detection rate from the ETC result. The simulation result shows that the DBRNN considerably outperforms conditional random field (CRF). Ogawa et al. [16] presented detection accuracy estimation method based on ETC. The ETC is an extension of confidence estimate. In ETC, all the words in the detection outcomes (detected word sequence) for the targeted speech information are categorized into three classes: insertion error (I), substitution error (S), and correct recognition (C).

3. Materials and Methods

In this study, an effective AIEDAE-ESRS technique has been developed for the error detection and accuracy evaluation in SRS. The AIEDAE-ESRS technique involves three major processes, namely, preprocessing, DNN-HMM-based speech signal recognition, and FPA-based hyperparameter tuning. The utilization of the FPA helps to properly alter the hyperparameter of the DNN-HMM model which assists in significantly boosting the detection performances. Figure 1 demonstrates the overall working procedure of the suggested AIEDAE-ESRS technique.

3.1. Level I: Speech Signal Preprocessing

The speech input is the original voice signal gathered by the voice tool; the preprocessing method chiefly consists of three factors: antialiasing bandpass filtering, eliminating the noise effect, and sampling the input original voice signal; the feature extraction method extracts the reflection in the voice. The acoustic parameter of the speaker’s key features primarily includes short-term average zero-crossing rate, cepstrum, short-term energy, and linear prediction coefficient. In the recognition phase, the speech feature parameter is attained, and the test template is made. In the test, the template is matched with the reference template as per some discriminative rules (i.e., semantic and grammar rules), later in the training phase, the feature parameter is processed for establishing a reference model, and the better reference template is attained as the detection outcome. Better matching results are closely associated with the matching template, quality of speech feature parameter, and speech technique.

3.2. Level II: Design of DNN-HMM-Based Speech Signal Recognitions

In traditional DNN-HMM-based recognition, the probability is modelled by GMM under the maximal probability condition. Such potential models are constrained because GMM is statistically ineffective to model information that lies on or near a nonlinear in the data space. To conquer this limitation, we proposed a DNN-HMM method for recognizing speech, in which the outputs of the DNN are given to the HMM as substitute for the GMM. GMM simulates the observed probability distribution of a feature vector in the presence of a phone. It establishes a sound foundation for determining the “distance” between a phone and the audio frame being observed. The GMM is a probabilistic model capable of simulating normally distributed subpopulations. GMM’s components are all Gaussian distributions. A statistical Markov model (HMM) is a sort of hidden Markov model. When the data is continuous, a Gaussian distribution is used to represent each hidden state.

3.2.1. Overview of Hidden Markov Models

The HMM is a statistical Markov method where the algorithm that has been modelled is presumed as a Markov model using unobserved (hidden) state. An HMM, denoted by , contains the subsequent element: (1)The amount of states in the system represented as , the number of states represented by , and the states at time (2), the transition state likelihood distribution(3), the observation probability, in which signifies the likelihood of observing at state is denoted as a finite mixture:Let be the mixture coefficients for th mixture in state , as well as elliptically symmetric density or log-concave, with covariance matrix and mean vector for th mixture element in state (4), the first state distribution, in which

In order to apply HMM, two issues must be resolved: (i)Learning issue: assuming a collection of ground truth (represent as trained set), the learning process detects the group of variables ; therefore, that detects the model parameter that well fits the trained set. The forward-backward method is utilized for calculating [17]. It finds the model parameter that best fits the training data. In order to compute , the forward-backward approach is used(ii)Decoding issue: assuming a parameter and a series of new observation (represent as testing set), the decoder process is determined as

In the event of speech recognition, train HMM for discrete class. For novel speech input , with estimated from the Viterbi model.

3.2.2. Structure of DNN-HMM Model

The main variation among GMM-HMM and DNN-HMM is the utilizing of GMM (rather than DNN) to evaluate the observation probability. We employ the DNN for modelling ; the following probabilities of the parameters provided the vector , i.e., feasible, while is easier for estimating from a first state-level position of the trained set. Figure 2 depicts the framework of DNN technique.

3.3. DNN-HMM Training Process

The thoroughly trained procedure for recognition is given below: (a)For all the classes , a GMM-HMM with state is instructed by the training sentence of parameter (b)For all the sentences in trained set , the Viterbi model of GMM-HMM as per Equation (3) is executed on to attain an optimum state sequence , and all the states are allotted a label [18](c)Each training sentence and labelled state sequence is utilized as input to train a DNN, where output is the previous probability of the output unit. The DNN is executed by BP model using (i) unsupervised pretraining or (ii) the discrimination pretraining

3.4. DNN-HMM Recognition Process

In the detection procedure, for input sentence , one must evaluate the likelihood for all the classes and attain the last recognition results as per Equation (4). In GMM-HMM, this likelihood is attained by the Viterbi model using Equation (3).

In DNN-HMM, adapt the subsequent process to estimate the likelihood . (a)The input structure is initially inputted to DNN, obtaining the previous probability as output. Next, previous probabilities are attained from , through representing the label to the and , as follows(b)As per the Bayesian principles, estimate the possibility as

In this process, the previous probabilities of each state is estimated from (occurrence of) the trained set, and is allocated a constant because the feature vector is considered independent of one another [19]. (c)For all the models , the Viterbi model is executed to estimate the prospect . But, the likelihood is substituted with estimated by Equation (5)

3.5. Level III: Design of FPA-Based Hyperparameter Tuning

The FPA is used to effectively adjust the DNN-HMM model’s hyperparameter settings. The abiotic pollination is discussed as well as induced in a flower pollination approach in the optimization. Pollination challenges encompass a difficult process in plant generation theory. The pollen gamete and bloom are more likely to provide a consistent solution to the optimization challenge. The advantages of FPA are listed below. FPA, unlike GA, HS, and PSO, provides a simple floral analogy with lightweight computationally based control parameters (that is, switch conditions, ). It also provides a balanced diversity and intensity of solution through the adaptation of levy flight (random walks punctuated by larger leaps) and switch conditions, which are utilized to transition between intensive local search and global search.

Flower constancy was identified as a precise solution that might be differentiated. In the case of global pollination, the pollinator transports pollen from a great distance to a more suitable location. In another example, local pollination was carried out inside a smaller region of a unique bloom in shade water [20]. Global pollination is carried out through a possibility known as switch probability. Pollination occurs all across the world when a pollinator transports pollen from vast distances to higher fitting. Global pollination is accomplished by the use of a probability known as switch probability. In a tiny area of a unique bloom, local pollination is carried out in water shade. Pollinators like bees are vital to the sexual reproduction of about ninety percent of wildflowers. Ecosystems depend on these plants to function. They provide food, shelter, and other resources for many animal species, including humans. Once the stage was removed, local pollination is substituted. In FPA technique, the following 4 rules are used (also shown in Algorithm 1): (1)Cross and live pollination is called global pollination as well as the carrier of pollen pollinator apples the LF(2)Abiotic and self-pollination are represented as local pollination(3)Pollinators are insects, i.e., able to develop flower constancy. It is determined as the possibility of two employed flowers(4)The transmission of local and international pollination is handled by switch possibility

Input: Objective function
Population initialization: Flower/Pollen gamete with arbitrary solutions;
Determine the optimal solution in the primary population;
Represent a switching possibility p ∈ [0, 1];
while (t<Maximum_iterations) do
 for i =1:n (every flower in the population) do
  if rand < p then
Draw a (d dimension) step vector L from a Levy delivery;
Apply global pollination;
  Else
   Draw from a similar distribution in ;
   Apply local pollination;
  End
  Determine new solutions;
  When better solutions are obtained, updating it in population;
 End
 Compute the present optimum solution;
End

Therefore, the first and second rules are given by in which is the pollen vector at iteration ; indicates a present solution from present generated outcomes; indicates the level factor to control phase size; and denotes pollination power, which is related with a step-size of levy allocation. The LF is calculated as a collection of random computations that have the duration of all the leaps and use the levy likelihood distribution function with infinite variation. Following that, represents a levy distribution: in which is the basic gamma function.

In the event of local pollination, the second and third rules are formulated as in which and are 2 pollens from several blooms on the same plant, if and come from the same species and are chosen from a homogenous population; this is represented as local random walks and is included by a standard distribution in zero and one [21].

FF acts as an important part of the optimization problem. It estimates a positive integer to represent how better the candidate solution is. In the work, classification error rate is considered as a minimalizing FF. The poorer solutions have high fitness scores and the richer solutions have less fitness scores.

4. Experimental Validation

In this study, the experimental result analysis of the AIEDAE-ESRS technique takes place using the MIT lecture English speech corpus, called MIT dataset [22]. The MIT corpus includes speech information from invited talks and systematic university classes. The length of a lecture exist between 45 and 90 minutes. First, the error detection result analysis of the AIEDAE-ESRS technique takes place under deletion error detection, confidence estimation, OOV word detection, and CSI classification in Table 1.

Figure 3 exhibits the comparative result analysis of the AIEDAE-ESRS technique with existing methods under confidence estimation. The figure reported that the AIEDAE-ESRS technique can accomplish effectual outcomes with the increased values of accuracy, average -score (AFS), and normalized cross entropy (NCE). It is noticed that the CRF and DNN models have shown least performance with the minimal values of accuracy, NEC, and AFS. In line with this, the DURNN, DULSTM, and DBRNN techniques have resulted in moderately closer accuracy, NEC, and AFS values. However, the AIEDAE-ESRS technique has outperformed the other techniques with the higher accuracy, NCE, and AFS of 0.8819, 0.4410, and 0.8302, respectively.

Figure 4 displays the comparative result analysis of the AIEDAE-ESRS system with present methodologies under OOV word detection. The figure stated that the AIEDAE-ESRS method has the capacity of achieving efficient outcomes with increased value of accuracy, NCE, and AFS. It is noted that the CRF and DNN models have shown minimum performance with minimal values of accuracy, NEC, and AFS. In line with this, the DURNN, DULSTM, and DBRNN systems have resulted in moderately closer accuracy, NEC, and AFS values. But, the AIEDAE-ESRS method has outperformed the other systems with the high accuracy, NCE, and AFS of 0.9708, 0.3720, and 0.7497, respectively.

Figure 5 displays the comparative analysis of the AIEDAE-ESRS procedure with current methods under CSI classification. The figure described that the AIEDAE-ESRS method has the capacity of achieving efficient outcomes with the increased values of accuracy, NCE, and AFS. It is noticed that the CRF and DNN models have shown minimum performance with minimal values of accuracy, NEC, and AFS. In line with this, the DURNN, DULSTM, and DBRNN systems have resulted in moderately closer accuracy, NEC, and AFS values. But, the AIEDAE-ESRS system has outperformed the other techniques with the higher accuracy, NCE, and AFS of 0.8579, 0.4120, and 0.6796 correspondingly.

Figure 6 displays the comparative analysis of the AIEDAE-ESRS method with current methods under deletion error detection. The figure described that the AIEDAE-ESRS method has the capacity of achieving efficient outcomes with the increased values of accuracy, NCE, and AFS. It is noticed that the CRF and DNN models have shown smallest performance with the minimal values of accuracy, NEC, and AFS. In line with this, the DURNN, DULSTM, and DBRNN methods have resulted in moderately closer accuracy, NEC, and AFS values.

Figure 7 exhibits the accuracy graph analysis of the AIEDAE-ESRS technique on the test MIT speech recognition dataset. The figure portrayed that the AIEDAE-ESRS technique has reached improved training and validation accuracy with increasing amount of epochs. It is also noticed that the training accuracy is considered to be lower compared to the validation accuracy.

Figure 8 demonstrates the loss graph analysis of the AIEDAE-ESRS technique on the test MIT speech recognition dataset. The figure depicted that the AIEDAE-ESRS technique has attained decreasing training and validation loss with a rise in the number of epochs. It is noticed the training loss is seemed to be higher than the validation loss.

Finally, a brief RMSE analysis of the AIEDAE-ESRS technique takes place under distinct sizes of training data is given in Figure 9 and Table 2 [23]. The result reported that the AIEDAE-ESRS technique has attained improved performance with the minimal values of RMSE compared to CRF and DBRNN techniques. For instance, with TS of 10%, the AIEDAE-ESRS technique has obtained lower RMSE of 1.12%, whereas the CRF and DBRNN techniques have attained higher RMSE of 2.21% and 1.81%, respectively. Meanwhile, with TS of 40%, the AIEDAE-ESRS system has attained lower RMSE of 0.93%, while the CRF and DBRNN techniques have attained higher RMSE of 1.99% and 1.80% correspondingly. Eventually, with TS of 60%, the AIEDAE-ESRS technique has obtained lower RMSE of 0.89%, whereas the CRF and DBRNN techniques have achieved higher RMSE of 1.98% and 1.75% correspondingly. Moreover, with TS of 80%, the AIEDAE-ESRS technique has obtained lower RMSE of 1.03%, while the CRF and DBRNN methods have accomplished higher RMSE of 2.01% and 1.77% correspondingly. Furthermore, with TS of 100%, the AIEDAE-ESRS technique has obtained lower RMSE of 1.16%, while the CRF and DBRNN systems have reached higher RMSE of 2.01% and 1.73% correspondingly.

By looking into the abovementioned figures and tables, it is ensured that our AIEDAE-ESRS methodology has gained maximal performances over the existing techniques.

5. Conclusion

In this study, an effective AIEDAE-ESRS technique has been developed for the accurate estimation and error detection in speech recognition model. The AIEDAE-ESRS technique involves three major processes, namely, preprocessing, DNN-HMM-based speech signal recognition, and FPA-based hyperparameter tuning. The utilization of the FPA helps to properly adjust the hyperparameters of the DNN-HMM model which supports to greatly increase the detection performance. The experimental result analysis of the AIEDAE-ESRS technique take place using benchmark dataset and investigated the results under several aspects. The simulation results reported the outstanding efficiency of the AIEDAE-ESRS methodology over the recent approaches. The improvements in experimental results reported the enhanced outcomes of the AIEDAE-ESRS technique based on various measures. With accuracy, NCE, and AFS values of 0.9921, 0.2640, and 0.6909, respectively, the AIEDAE-ESRS system outperformed the other techniques. With a TS of 100%, the AIEDAE-ESRS technique achieved a reduced root mean square error of 1.16 percent, whereas the CRF and DBRNN systems achieved a higher root mean square error of 2.01% and 1.73 percent, respectively. In the future, the performance of the AIEDAE-ESRS technique is additionally improved by the advanced DL models for speech recognition.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.