Abstract

Speech recognition technology is a multidisciplinary field, comprising signal processing, pattern recognition, acoustics, artificial intelligence, etc. Presently, speech recognition plays a vital role in human-computer interface in information technology. Due to the advancements of deep learning (DL) models, speech recognition system has received significant attention among researchers in several areas of speech recognition like mobile communication, voice recognition, and personal digital assistance. This paper presents an automated English speech recognition using dimensionality reduction and deep learning (AESR-DRDL) approach. The proposed AESR-DRDL technique involves a series of operations, namely, feature extraction, preprocessing, dimensionality reduction, and speech recognition. During feature extraction process, a hybridization of high-dimension rich feature vectors is derived from the speech as well as glottal-waveform signals by the use of MFCC, PLPC, and MVDR techniques. Besides, the high dimensionality of features can be reduced by the design of quasioppositional poor and rich optimization algorithm (QOPROA). Moreover, the Bidirectional Long Short-Term Memory (BiLSTM) technique is employed for speech recognition, and the optimal hyperparameter tuning of the Bidirectional Long Short-Term Memory technique can be chosen using Adagrad optimizer. For the dimensionality reduction technique, the quasioppositional poor and rich optimization algorithm (QOPROA) is applied. The performance validation of the AESR-DRDL technique is carried out against benchmark datasets, and the results reported the better performance of the AESR-DRDL technique compared to recent approaches. The AESR-DRDL technique has shown to be superior in terms of recovery time, with an average of 0.50 days. The AESR-DRDL method's overall performance has been validated using benchmark datasets, and the results show that it outperforms more current technique. Because of this, the AESR-DRDL approach can be used to recognize English speech.

1. Introduction

Voice is widely employed and is considered of the significant data while interacting with people. Voice recognition technique permits machines to translate human voice signals into corresponding commands via understanding and recognition. [1]. When people want to expose some kind of data, they have to use the voice signals that carry the data. The speech signal comprises data about the speaker's personal identity and semantic content. The basis of speech recognition is that all the speakers have distinct features because of their pronunciation and own unique vocal tract features [2]. Speech recognition technology is a cross-discipline including pattern recognition, signal processing, artificial intelligence, hearing mechanism, and sound mechanism. Now, speech recognition has gradually become a basic technique of human-computer exchange in information technology. With the growth of constant speech recognition rate, speech recognition input is becoming an essential form of computer input [3].

Speech identification is the method of translating speech communication into text transcripts. With the growth of computation power and accessibility of transmission with computer through voice, speech detection system has attracted a growing interest. Speech-to-text and text-to-speech schemes are commonly employed in various techniques, namely, personal digital assistance systems, mobile communication, and search engines [4]. But speech recognition has a number of problems. Initially, some languages are existing for speech recognition system. Next, speech recognition system has to deal with this variation in the data sets, namely, gender and accent [5]. Then, speech signal has several distorting features, namely, microphone quality, environmental, and background noise factor. For dependable voice recognition, speech recognition is considered one of the best capable solutions. However, achieving excellent identification performance necessitates a careful selection of sensory elements. Since deep neural networks can efficiently extract robust latent features that allow various recognition algorithms to validate revolutionary generalization capabilities under a wide range of application conditions, deep learning techniques have recently involved increasing attention in the machine-learning community. It is possible to understand what someone is saying when they use speech recognition software. It is the responsibility of ASR software to recognize human speech and convert it into text.

In order to conquer this challenge, deep learning (DL) architecture has gained much recognition in speech recognition research. The DL method is a subdivision of machine-learning (ML) technique, which employs a group of processes that try to model higher-level abstraction through a deep graph with various processing layers, consisting of many linear and nonlinear conversions [6]. It is the capability of a system or software program to identify words spoken audibly and turn them into legible text that is known as speech recognition or voice-to-text technology. The DL method offers automated selection and ranking of features in the data sets with effective algorithm. In recent times, DL method has gained huge interest and big achievements in the field of applications, like NLP, image processing, speech recognition, and sequence alignment. With the high performance and recent popularity of DL framework, authors begin to adapt this architecture for speech recognition problems. RNN and CNN are the most preferred architectures, i.e., utilized in speech recognition [7, 8].

This research creates a novel automated English speech recognition with dimensionality reduction and deep learning (AESR-DRDL) approach. The proposed AESR-DRDL technique designs the feature extraction process using hybridization of high-dimension rich feature vectors derived from the speech as well as glottal-waveform signals by the use of MFCC, PLPC, and MVDR techniques. In addition, the quasioppositional poor and rich optimization algorithm (QOPROA) is used for the dimensionality reduction technique. Finally, Adagrad optimizer with BiLSTM model is applied for the recognition of speech signals.

This segment offers a complete review of recently developed speech recognition models. In [9], three methods are examined to enhance speech recognition on Mandarin-English code-switching tasks. Firstly, multitask learning (MTL) is presented, which allows language identity data to enable Mandarin-English code-switching ASR. Obtaining confidence scores from automated speech recognition (ASR) systems is extremely important for downstream applications. A number of recent studies have advocated the use of neural networks to learn confidence scores for words or utterances for end-to-end automatic speech recognition (ASR). The results of those investigations show that word confidence alone does not model deletions, and utterance confidence does not take use of word-level training signals. Next, examine word pieces, in opposition to graphemes, as English modelling units to decrease the modelling unit gaps among English and Mandarin. Even in nonphonetic languages such as English, phoneme-based models consistently beat graphene-based models when it comes to conventional speech detection techniques. In most cases, when more training data is collected, the performance gap between the two gets smaller. Then, it uses TL method for using large number of English and monolingual Mandarin information for compensating the sparsity problem of code-switching tasks. It is used in hybrid automatic speech recognition (ASR) systems to transfer knowledge from one language to another. Encoders and/or prediction networks in the destination language can be pretrained with the source language's models. It is used to initialize the target language AM in hybrid ASR systems. It depends on the initialization model for the encoder and prediction networks.

Weng et al. [10] presented attention-based sequence-to-sequence method for end-to-end speech recognition system. Initially, they proposed an input-feeding framework that feeds previous decoder hidden state data and context vector as input to the decoder. Next, they proposed a hypothesis generation system for consecutive minimal Bayes risk (MBR) training of sequence-to-sequence model in which softmax smoothing into N-best generation in MBR training is introduced. Jiao et al. [11] proposed a DBN-SVM model to detect and classify the error from pronunciation; the model corrects the errors and scores the quality in pronunciation. This method is protracted to speech assessment mode. Then, various researches have been conducted for testing various features, involving the real-timeliness of recognition, the accuracy of pronunciation classification and error detection, recognition rate of distinct vocabularies, and environments.

Sujatha et al. [12] designed a complex system for Speech Analysis using Lexical Analyzer (SALA). This method is utilized in various fields of interest like employment, teaching, one's dexterity in English vocabulary, and communication skills. The presented method takes input audio that comprises English speech. In [13], the linear prediction coding coefficient extraction technique is employed for summing up the information based on English digits pronunciation. After extracting the dataset, it can be employed for an ENN to identify the relations among the linear coding coefficients of audio file with the pronounced digit.

Lin et al. [14] focused on strong speech recognition in air traffic control (ATC) by developing a processing model for integrating multilingual speech recognition into an individual architecture using three cascaded models: pronunciation model (PM), language model (LM), and acoustic model (AM). The AM translates ATC speech into phoneme-based text sequences that the PM later converts to word-based order, i.e., the final objective of this study. In [15], a feature representation learning architecture has been proposed in this study. This method is encompassing the usage of combination of various extracted feature representations with Compact Bilinear Pooling (CBP), Automated Speech Recognition (ASR), DNN as feature extractor, and last inference through optimized RNN classifiers.

3. The Proposed Model

In this study, an effective AESR-DRDL technique has been developed for the recognition of English speech signals. The proposed AESR-DRDL technique incorporates several stages of operations like preprocessing, feature extraction, QOPROA-based feature selection, Adagrad based hyperparameter optimization, and BiLSTM based speech recognition. The design of QOPROA model is used for reducing the dimensionality of the features and improving the recognition performance. Feature selection is important in text classification since it helps decrease the high dimensional feature space that exists. It is less expensive to compute more accurately to use text classification systems when the feature space is reduced in size. As a result, in text classification, the challenge of identifying the appropriate mix of features is critical. Figure 1 illustrates the overall process of AESR-DRDL technique. The detailed work of each section is suggested in the succeeding sections.

3.1. Preprocessing

The speech input in the figure is a novel voice signal gathered by the voice equipment; the preprocessed technique mostly contains 3 features: sampling the input original voice signals, antialiasing band-pass filter, and eliminating the noise impact caused by numerous features; the feature extraction method was mostly for extracting the reflection from the voice.

3.2. Feature Extraction

Here, the Perceptual Linear Prediction Cepstral Coefficients (PLPC), Perceptual Minimum-Variance Distortionless Response Cepstral Coefficients (PMVDR), Mel-frequency Cepstral Coefficients (MFCC), pitch (F0), and their 1st and 2nd order derivatives are derived as feature vectors of the input speech signals and glottal-waveform signals [16]. As the features are computed in distinct ways, the speech signals can be defined in several ways. Features are combined together for building a complete feature vector. In this study, the extraction of the MFCC and PLPC features takes place using Dan Ellis' toolbox, and pitch feature extraction is carried out by the use of COVAREP toolbox. Every feature extraction level is performed specifically on the input speech signals and glottal waveforms. For the extraction of glottal signals from speech signals, the COVAREP toolbox is applied. When the feature extraction procedure is completed, a feature matrix comprising of rows and columns representing the frames and distinct feature vector components is derived. Speech and glottal-waveform signals provide them. It can make various speaking signals. These are the features of a dataset. COVAREP extracted pitch, while Dan Ellis' toolbox obtained MFCC and PLPC. It targets glottal and vocal waves. Extraction of glottal signals with COVAREP A feature matrix is created for each frame and vector component.

3.3. Dimensionality Reduction Using QOPROA

Once the high dimensional features are derived, the QOPROA is utilized to choose an optimal subset of features. The PRO is depending on people’s wealthier behavior in society [17]. Generally, they are classified into two financial categories within a society. Initially, it comprises richer people (wealth is greater than normal). Next, it comprises poorer people (wealth is less than regular). All the persons in this group are searching for improving their financial status in society. The poorly economical people try to enhance their financial status and decrease the class gap by learning from richer persons. In the optimization issue, all the solutions in the poorer population move to the global optimum solution in the searching space by learning from the richer solution in the Richer people. In this study, every solution or person in the population is signified as binary vector. The binary code of the individual signifies that the feature was not selected, and the binary code 1 of the individual signifies that the feature was chosen.

The individual is signified as , whereas denotes the amount of features in text corpus. All the positions of the solution or person are binary values. For instance, a solution or person determined as [0, 1, 0, 1, 0, 0, 1, 1, 1, 0] represents that the feature or term with indexes 2, 4, 7, 8, and 9 is chosen, whereas the others are not selected [18]. The set of solutions in the present generation is named a population. Consider that “N” indicates the population size. Arbitrarily, create ‘N’ solution with real numbers within zero and one. The process of translating data into a digital representation is known as digitization. The data is arranged into distinct pieces of data (called bit s) that can be delivered individually (typically in multiple-bit groupings called byte s) in this fashion. Information can be preserved, accessed, and shared more easily when it is digitally stored. Afterward, the digitization method is used for all the locations of solution to convert real values into binary values as follows:

Now, rand denotes an arbitrary value in the range of zero and one. The candidate solution in the population would be arranged according to the main purpose. The topmost part of the population is mentioned as richer economic class people, and the bottom part of populations are denoted as poorer economic class people as follows:

Richer people move to raise the economic class gap by observing the people in poorer economical groups. The poorer economical group of people are moving to decrease the economic class gap by learning from the people in richer economical groups to improve their financial position. This natural behavior of poor and rich peoples is utilized for generating a novel solution. The movement of richer solution is determined in

The movement of poorer solutions is determined in

Opposition-based learning (OBL) is defined in order to reduce the computational complexity and enhance the convergence capability of distinct evolutionary algorithms (EAs) [19]. With the consideration of every present population and the opposite population depending upon the OBL concept, the candidate solutions can be enhanced. The quasiopposite number is used for the generation of optimal solutions compared to opposite number. The opposite number, opposite point, quasiopposite number, and quasiopposite point can be defined using the following equations. Different evolutionary algorithms (EAs) [19] are used to minimize computing complexity and improve convergence. The proposed solutions can be improved by taking into account the current and opposite populations based on the OBL principle. For ideal solutions, the quasiopposite number is employed. For any arbitrary number , the opposite number can be represented as follows:where the opposite point for multidimension searching area ( dimension) can be represented usingand the quasiopposite number of any arbitrary number can be denoted usingLikewise, the quasiopposite point for multidimension searching area ( dimension) can be represented by

The feature dimensionality reduction process takes place using the QOPROA technique. Each position vector considered the value of “” or “1” where represents that the features are not selected, and 1 indicates the selected features. The transfer function method implies the probability of modifying position vector elements among 0 and 1 efficiently. A transfer function significantly affects the performance of the FS processes and the outcome of the FS technique at the time of searching process. The fitness function of the QOPROA technique can be derived to determine the solutions in obtaining the tradeoff between two objectives, as defined in the following:where denotes the classifier error rate, indicates the number of features chosen, and implies a total number of features involved from the existing datasets. indicates a variable used to compare the weight of error rate of classification.

3.4. Speech Recognition Using Optimal BiLSTM Model

Finally, the recognition of speech signals takes place using the BiLSTM model. LSTM is a version of the RNN techniques that solves the issue of gradient vanishing. It helps improve the storage strategy of the NN for receiving input and training data. It is useful to model the time series data such as text. The BiLSTM is the integration of the backward LSTM and forward LSTM [20]. The major benefit of the BiLSTM model is that the sequence details are completely exploited in the network. The LSTM unit includes input gate , forget gate , and output gate , as well as a memory cell state . They intend to influence the capability of the units in storing and upgrading data. The input gate offers a value in the range of 0 to 1 depending upon the input and . If the outcome becomes 1, it is defined that the cell state details are entirely sustained, and if the outcomes become zero, it gets entirely discarded. Followed by this, the input gate layer decides that the value needs to be upgraded, and layer generates a novel candidate value vector , which can be appended to the cell state. Then, they can be integrated for updating the cell state Lastly, the final gate determines the outcome depending upon the cell state [20]. Specifically, , and denote the intrinsic variables in the LSTM training process, is sigmoid activation function, and implies the dot multiplication.

The BiLSTM model encompasses forward as well as backward LSTM. in BiLSTM reads input from to for generating and other read the input from to for generating :

The forward and inverse sequence representations produced by and are linked together to generate a long vector, and the integrated outcome defines the present time to the input:

At last, the outcome of the entire series is attained, where and are used for representing the outcome in the hidden layer. The intermittent layers in the BiLSTM model return the entire data and ensure that the outcome of every hidden layer sustains the long-term data. Figure 2 demonstrates the structure of BiLSTM technique.

For optimally tuning the hyperparameters of the BiLSTM model, the Adagrad optimizer is applied to it. In the Adagrad optimizer, the gradient and accumulated squared gradients for every variable can be computed at the round [21]:where denotes elementwise multiplication, and indicates the gradient of present variable at the round. The variables in the Adagrad can be upgraded using where implies learning rate, and denotes a smoothing component, which eliminates the division by zero. Since the learning rate is fixed prior to the training process, equation (15) can be determined as follows:

It is known that denotes the computation of earlier gradient, and can be represented as follows:

Therefore, the Adagrad can be upgraded using

It is identical to the upgrade procedure of the classical gradient descent. Therefore, Adagrad optimizer can be considered for hyperparameter tuning using the gradient.

4. Experimental Validation

For experimental analysis, the data from the enhanced TEDLIUM release 2 corpus is used [8]. The results are examined in terms of different aspects. Table 1 and Figures 35 deal with the performance analysis of the AESR-DRDL technique under varying batch sizes (BS) and epoch counts. The figure offers comprehensive result analysis of the AESR-DRDL technique under BS of 32 and distinct epochs. The experimental results reported that the AESR-DRDL technique has accomplished maximum performance under every epoch. For instance, with 100 epochs, the AESR-DRDL technique has attained WER of 76.17%, loss of 154.63%, and MED of 0.3247. Likewise, with 300 epochs, the AESR-DRDL technique has accomplished WER of 75.22%, loss of 151.19%, and MED of 0.3175. Similarly, with 500 epochs, the AESR-DRDL technique has obtained WER of 76.90%, loss of 154.11%, and MED of 0.3278.

The experimental results provide comprehensive outcome analysis of the AESR-DRDL approach under BS of 64 and distinct epochs. The experimental results reported that the AESR-DRDL system has accomplished maximum performance under every epoch. For instance, with 100 epochs, the AESR-DRDL technique has attained WER of 75.23%, loss of 153.80%, and MED of 0.3230. Following this, with 300 epochs, the AESR-DRDL algorithm has accomplished WER of 75.11%, loss of 155.64%, and MED of 0.3268. Also, with 500 epochs, the AESR-DRDL method has obtained WER of 76.60%, loss of 151.07%, and MED of 0.3173.

A comprehensive analysis of the AESR-DRDL technique under BS of 128 and varying epochs reported that the AESR-DRDL technique has accomplished maximal performance under every epoch. For instance, with 100 epochs, the AESR-DRDL technique has achieved WER of 75.18%, loss of 151.86%, and MED of 0.3189. In addition, with 300 epochs, the AESR-DRDL system has accomplished WER of 76.16%, loss of 154.60%, and MED of 0.3247. Eventually, with 500 epochs, the AESR-DRDL technique has reached WER of 76.49%, loss of 153.74%, and MED of 0.3229.

Table 2 and Figures 68 examine the performance analysis of the AESR-DRDL approach under different BS and layer counts. The figure depicts comprehensive result analysis of the AESR-DRDL approach under BS of 32 and distinct layers.

The experimental outcomes revealed that the AESR-DRDL technique has accomplished maximal performance under every layer. For instance, with 100 layers, the AESR-DRDL algorithm has attained WER of 75.39%, loss of 152.72%, and MED of 0.3207. Similarly, with 300 layers, the AESR-DRDL technique has accomplished WER of 76.67%, loss of 150.97%, and MED of 0.3170. Eventually, with 500 layers, the AESR-DRDL methodology has reached WER of 75.16%, loss of 156.26%, and MED of 0.3282.

The simulation values demonstrate the comprehensive result analysis of the AESR-DRDL system under BS of 64 and distinct layers. The experimental results reported that the AESR-DRDL methodology has accomplished increased performance under every layer. For instance, with 100 layers, the AESR-DRDL algorithm has gained WER of 76.05%, loss of 155.90%, and MED of 0.3274. Besides, with 300 layers, the AESR-DRDL technique has accomplished WER of 76.93%, loss of 155.40%, and MED of 0.3263. Lastly, with 500 layers, the AESR-DRDL technique has obtained WER of 75.10%, loss of 150.95%, and MED of 0.3170.

A comprehensive result analysis of the AESR-DRDL technique under BS of 128 and different layers revealed that the AESR-DRDL technique has accomplished higher performance under every layer. For sample, with 100 layers, the AESR-DRDL approach has attained WER of 75.85%, loss of 153.22%, and MED of 0.3218. Also, with 300 layers, the AESR-DRDL technique has accomplished WER of 76.65%, loss of 156.37%, and MED of 0.3284. Similarly, with 500 layers, the AESR-DRDL method has obtained WER of 74.48%, loss of 151.71%, and MED of 0.3186.

A comparative study of the AESR-DRDL technique with existing techniques takes place in Table 3 [22].

Figure 9 offers the WER analysis of the AESR-DRDL technique with existing techniques. The figure reported that the PPCA and DNN techniques have obtained higher WER of 88.10% and 88.06%, respectively. Following this, the RNN and PQPSO techniques have attained slightly reduced WER of 87.02% and 87.67%, respectively. Moreover, the LSTM and GRU techniques have accomplished reasonable WER of 77.55% and 79.39%, respectively. However, the AESR-DRDL technique has resulted in maximum outcome with a minimal WER of 75.53%.

Figure 10 gives the loss analysis of the AESR-DRDL method with existing techniques. The figure stated that the PPCA and DNN approaches have obtained higher loss of 185.53 and 185.01 correspondingly. Then, the RNN and PQPSO methods have gained slightly lower loss of 186.61 and 179.25, respectively. Furthermore, the LSTM and GRU techniques have accomplished reasonable loss of 160.51 and 162.22 correspondingly. But, the AESR-DRDL algorithm has resulted in maximal outcomes with a minimal loss of 152.85.

Figure 11 provides the MED analysis of the AESR-DRDL system with existing approaches. The figure depicted that the PPCA and DNN techniques have obtained higher MED of 0.4407 and 0.4574, respectively. Afterward, the RNN and PQPSO techniques have attained somewhat lesser MED of 0.4484 and 0.4515, respectively. Moreover, the LSTM and GRU techniques have accomplished reasonable MED of 0.3853 and 0.3939 respectively. However, the AESR-DRDL methodology has resulted in maximal outcomes with the lesser MED of 0.3210.

Table 4 and Figure 12 define the running time (RT) analysis of the AESR-DRDL approach with existing techniques. The figure reported that the PPCA and DNN techniques have obtained higher RT of 2 days and 1.60 days correspondingly. Besides, the RNN and PQPSO techniques have obtained slightly lesser RT of 1.10 days and 1.50 days correspondingly. Moreover, the LSTM and GRU techniques have accomplished reasonable RT of 0.80 days and 1.02 days correspondingly. Finally, the AESR-DRDL technique has resulted in superior outcome with the minimal RT of 0.50 days.

From the abovementioned tables and figures, it can be ensured that the AESR-DRDL algorithm has resulted in maximum speech recognition performance over the other techniques.

5. Conclusion

In this study, an effective AESR-DRDL technique has been developed for the recognition of English speech signals. The proposed AESR-DRDL technique incorporates several stages of operations, namely, preprocessing, feature extraction, QOPROA-based feature selection, BiLSTM based speech recognition, and Adagrad based hyperparameter optimization. The design of QOPROA technique helps in reducing the dimensionality of the features and improving the recognition performance. Automated English voice recognition (AESR-DRDL) is presented in this paper, using dimensionality reduction and deep learning (AESR). Prior processing, feature extraction, dimension reduction, and speech recognition are all included in this new AESR-DRDL method. High-dimensional rich feature vectors are extracted during feature extraction from the speech and glottal-waveform signals by the employment of MFCC, PLPC, and MVDR approaches. Furthermore, the invention of a quasioppositional poor and rich optimization approach can lower the high dimensionality of features (QOPROA). In order to highlight the enhanced experimental result analysis of the AESR-DRDL technique, a wide range of simulations occur on benchmark datasets, and the results indicated the superior outcomes of the AESR-DRDL technique compared to recent approaches. With an average recovery time of 0.50 days, the AESR-DRDL approach has proven to be superior. The performance validation of the AESR-DRDL technique is carried out against benchmark datasets, and the findings revealed that the AESR-DRDL technique outperformed recent approaches in terms of overall performance. Therefore, the AESR-DRDL technique can be applied as an effective tool for English speech signal recognition. Many simulations were done using benchmark datasets to show the AESR-DRDL technique's superiority. AES-DRDL outperformed others. The AESR-DRDL approach is used to recognize English voice signals. In future, hybrid DL models can be used instead of the BiLSTM model to further improve the recognition performance.

Data Availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare that there are no conflicts of interest.