Abstract

We introduce a multiengine speech processing system that can detect the location and the type of audio signal in variable noisy environments. This system detects the location of the audio source using a microphone array; the system examines the audio first, determines if it is speech/nonspeech, then estimates the value of the signal to noise () using a Discrete-Valued Estimator. Using this value, instead of trying to adapt the speech signal to the speech processing system, we adapt the speech processing system to the surrounding environment of the captured speech signal. In this paper, we introduced the Discrete-Valued Estimator and a multiengine classifier, using Multiengine Selection or Multiengine Weighted Fusion. Also we use the as example of the speech processing. The Discrete-Valued Estimator achieves an accuracy of 98.4% in characterizing the environment's . Compared to a conventional single engine system, the improvement in accuracy was as high as 9.0% and 10.0% for the Multiengine Selection and Multiengine Weighted Fusion, respectively.

1. Introduction

Speech processing systems, such as Speaker identification () and Automatic Speech Recognition (ASR), have two operating modes: a training mode and a testing mode. The goal of the training mode is for the classifier to learn how to discern between different speech classes, using exemplars whose classes are known. The purpose of the testing mode is to classify speech data, from an unknown class, based on the training.

The performance of most speech processing systems is degraded severely when the training and the testing audio are captured in different acoustic environments (e.g., different levels of noise [14]). For example, training speech may be captured using a telephone handset in a noise-free environment, while the test speech is captured using a hands-free phone in a noisy environment. The conventional method to improve the performance of speech processing in noisy environments is to perform speech enhancement prior to speech processing [27]. In this method, training speech signals are obtained in a clean environment (i.e., no noise). The test speech signals, however, are obtained in environments that may contain noise. Denoising and/or speech enhancement methods may be applied to these signals to reduce the effects of the environment; however, often these methods enhances the speech signal audibility from the human perspective, but ignores the fact that the enhancements may remove or distort information in the speech signal, which may be essential to the speech processing.

In [2], the effect of five different speech enhancement techniques on a GMM-based ASR system was examined. These enhancement techniques were (1) perceptual wavelet adaptive denoising (PWAD) (TPS-PWAD), (2) VAD-based noise floor estimation (VAD-NFE), (3) minimum-mean-square-error-based two-dimensional spectral enhancement (MMSE-TDSE), (4) two-pass adaptive noise estimation (TPANE), and (5) combined two-pass spectral and PWAD. The performance of the speech enhancement techniques was evaluated using Perceptual Evaluation of Speech Quality-Mean Opinion Score (PESQ-MOS) [7, 8]. PESQ-MOS predicts the subjective quality of input speech by returning a score between 0 and 4.5; the higher the score, the cleaner the input speech (i.e., higher ). These enhancement techniques were used as preprocessing blocks to the ASR engine. The overall performance was evaluated based on the ASR classification accuracy using each enhancement technique.

The author of [2] noted the following.(i)The speech enhancement techniques of the corrupted speech signal with white noise generally improved the classification accuracy of the ASR; however, this improvement is inconsistent for different values for a single enhancement technique.(ii) The MMSE-TDSE has the greatest average improvement in the speech quality PESQ-MOS score, over range from 0 to 20 dB, an average improvement of 0.7 points. This improvement, however, did not translate into the best ASR classification accuracy.(iii)At  dB, the MMSE-TDSE speech enhancement resulted in a decrease in ASR classification accuracy of −0.71%, relative to ASR without speech enhancement techniques.(iv)The VAD-NFE achieved the best improvement of 0.5 dB and 3.7 dB at 20 and 15 dB, respectively; however, this did not translate into the best ASR classification accuracy at these values.

The results of [2] suggest that optimizing the quality of speech perception or , using speech quality enhancement techniques, does not optimize speech processing performance. Indeed, speech quality enhancement techniques may actually degrade the speech processing performance instead of improving it.

In [5], the combination of four speech enhancement techniques have been combined to enhance the ASR in noisy environment. These speech enhancement techniques were (1) spectral subtraction with smoothing of time direction, (2) temporal domain SVD-based speech enhancements, (3) GMM-based speech estimation, and (4) KLT-based comp-filtering. In [5], two combination methods have been introduced to combine the speech enhancement techniques: selection of front-end processor and combination of results from multiple recognitions processors. The speech enhancement techniques are applied on the captured noisy speech. A combination of these speech enhancement techniques is combined using one of the two proposed combination methods. The overall performance was evaluated based on the ASR classification accuracy using the two combination techniques. The authors of [5] reported an ASR accuracy enhancement of 5.4% of the proposed combination methods; compared to best performing standalone speech enhancement technique. This combination has been achieved using a noise environment detector with a detection accuracy of 54%. Using an ideal noise environment detector with detection accuracy of 100% can enhance the ASR accuracy with only 1.9%, compared to the environment detector with detection accuracy of 54%. As indicated in [5] the main factor of enhancing the ASR accuracy is the proposed combination method using different speech enhancement techniques. Meanwhile the authors [5] did not discussed the fact indicated in [2] that speech enhancement techniques may actually degrade the speech processing performance instead of improving it.

In this paper, we introduce a speech processing system that adapts to the surrounding environment of the speech signal, rather than trying to adapt the speech signal to the system; in this system, no speech enhancements are applied to the captured speech signal. This speech processing system characterizes its environment from the captured speech signal and adapts the speech processing to better suit the environment, enhancing the overall performance in variable environments. In this paper, speaker identification () is used as the speech processing system.

The system proposed in this paper uses multiple engines, each trained in a noisy environment but with different levels. Two multiengine fusion systems are proposed: Multiengine Selection and Multiengine Weighted Fusion (Figure 1). These proposed systems permit the intelligent fusion of the multiple engines using noise information of the surrounding environment. The noise information is determined by a Discrete-Valued Estimator from the incoming speech signal. The proposed system enhances the performance, improving classification accuracy and robustness to different noise conditions.

The multiengine fusion system is being proposed for use in a novel security instrument using microphone array. This security instrument is capable of robustly determining the identity and location of a talker, even in a noisy environment. The instrument is also capable of identifying the audio type and location of nonspeech audio signals (e.g., footsteps, windows breaking, and cocktail noise) [9]. This instrument can be incorporated in a variety of applications, such as hands-free audio conferencing systems to perform voice-based security authentication for the conference users [10]. Also, this instrument can be utilized in smart homes and health monitoring for older adults that will support independent living and enable them to continue to age in their own homes [11]. Smart homes can benefit from this instrument by monitoring their occupants through analysis of the audio signals present in the home and recognition of certain sound sources. For instance, respiration, falls to the ground, and water leaking sounds could be detected. The proposed instrument could perform further analyses and make decisions according to the captured audio type, such as extracting the respiration rate, calling emergency help, or alerting the occupant that a plumber may be needed. The analysis of the audio data preserves the privacy of the users better than video monitoring.

Previously mentioned applications require high performance to have successful functionality. In addition, the audio processing system needs to be robust even in variable conditions, such as environments with different levels of noise. Poor instrument robustness would limit utility of these systems.

The remainder of the paper is organized as follows. In Section 2 of this paper, we describe the overall implementation of the proposed approach. This section introduces and discusses the speech/nonspeech classifier, the Discrete-Valued Estimator, the multiengine system, and the two methods of multiengine fusion: Multiengine Selection and Multiengine Weighted Fusion. Performance of the multiengine system is evaluated using a library of speech signals in noisy environments. In Section 3, results of this performance evaluation are discussed and compared against the baseline method of . Major conclusions are summarized in Section 4.

2. Proposed System

2.1. Instrument Overview

A block diagram of the overall instrument is shown in Figure 2. The microphone array captures the audio signal then sends the digitized audio signal to the localization block to determine the location of the sound source [12, 13] in addition to steering the microphone array to reduce undesired acoustic noise.

A speech/nonspeech classifier is used as a frontend to appropriately select the next stage in audio processing, which is either the system or the nonspeech audio classification system [9]. The speech/nonspeech classifier is also adapted to the environment, so that it is robust to varying levels of noise. In this paper, we utilize the Discrete-Valued Estimator (Section 2.4) to adapt the speech/nonspeech classifier to the surrounding environment based on the estimated value. If the detected signal is nonspeech, the audio classification system will determine the audio type of the detected signal (e.g., footsteps, widows breaking, and door opening sounds); otherwise, the speech signal will be sent to both the Discrete-Valued Estimator and the multiengine , consisting of engines to identify the talker. Each engine is trained in an environment with a specific noise condition. Using the estimate value from the Discrete-Valued Estimator, the system fuses the outputs of the engines.

We are introduced to a new fusion method, Multiengine Weighted Fusion. Multiengine Selection selects the engine with training environments that best matches the surrounding environment; the multiengine system decision is equal to the decision associated with this single engine. The Multiengine Weighted Fusion system intelligently weights and combines the output decisions of the engines; the higher the similarity of the testing environment to the engine’s training environment, the higher the weighting that engine receives.

The resultant talker identity (or audio type for nonspeech audio), along with the location information, is sent to the data processing system for application specific tasks. For example, these data can be utilized in audio-video hands-free conferencing, where a video camera can be appropriately steered to talkers and not steered toward undesired audio sources.

2.2. Speech/Nonspeech Classification

In [9, 14], pitch ratio () was used to perform speech/nonspeech classification. The algorithm calculates the presence of human pitch (70 Hz < human pitch < 280 Hz) in multiple small frames in certain audio segment duration as shown in Figure 3. The is calculated as follows: where where is floor operator, NP is the number of frames that have human pitch, and is the total number of frames. This value is then compared to a certain threshold. The speech/nonspeech decision is made as follows:

In [9], a threshold was empirically determined as a function of frame size (), segment duration (), and overlap (). Noise was shown to deteriorate the pitch period of sound. This phenomenon reduces the capability of the pitch detection algorithm to detect the presence of the pitch. To remedy this, the threshold of the algorithm is adapted to maximize the accuracy of the speech/nonspeech classifier. We have found that the optimal threshold (Figure 4(a)), the threshold corresponding to the maximum speech/nonspeech classification accuracy (Figure 4(b)), reduces as the reduces. This adaptive speech/nonspeech algorithm was termed the adaptive pitch ratio (APR) algorithm [9]. Adaption is performed based on the estimates from the Discrete-Valued Estimator.

If the decision of the APR is nonspeech, further audio classification is performed to determine the audio type. Nonspeech audio classification will not be discussed here, as it is not the focus of this paper; however, readers can refer to [9] for more details. If the decision of the APR is speech, then an algorithm is used to identify the talker.

2.3. Discrete-Valued Estimator

An estimator that is computationally efficient with low complexity is desirable for this system, as it enables the selection of the SI engine in real time. Output of this estimator should be a discrete value that matches one of the training environments’ s. This estimator provides information about the noise in the surrounding environment by extracting features from the captured audio sounds emanating within this environment.

In this paper, the Discrete-Valued Estimator is implemented using covariance matrix modeling [1, 15, 16]. The feature vector of the noisy environment is the mel-frequency spectrum coefficients (MFSCs) and the delta MFSC. The MFSCs are computed in a similar fashion as mel-frequency cepstrum coefficients (MFCCs) using a filter bank output but without the use of the discrete cosine transform (DCT) [17]. The DCT of the MFSC vector decorrelates the feature vectors’ elements, which leads to a feature vector with low sensitivity to environment change in the audio signal and high sensitivity to the talker (i.e., dependent on the talker) [18]. In our case we desire a feature vector with high sensitivity to the noise but independent of the talker (i.e., provides information about the environments without being affected by who uttered the speech). Thus, the DCT stage has been removed from the feature vector generation process.

To implement the Discrete-Valued Estimator, let represent the covariance matrix of the training feature vectors for environment and let represent the covariance matrix of the test feature vectors from an unknown environment. Feature vectors for both training and testing covariance matrices, have a length ; therefore, and are matrices of size and are computed as follows: where vector is the mean of the training vectors and vector is the mean of the test vectors . Using the following property:

The distance between training environment and test environment can be calculated using the arithmetic harmonic sphericity measure as follows [1, 16, 18]: where is the feature vector length. The distance represents the similarity between training environment and the test environment; the smaller the distance, the more the similarity between the training and test environment. Then the distance vector is generated as follows:

The proposed instrument only requires a rough estimation of the value to set the threshold of the APR and to calculate the distance between the surrounding environment and the training environments to fuse the output of the multiengines. Other estimators, such as estimation using high-order statistics [19], have higher resolution (i.e., estimates as continuous scale); meanwhile, they have higher complexity. The increased precision is not needed in the proposed instrument implementation, so the increased complexity can be avoided.

2.4. Speaker Identification Engines

Gaussian mixed models (GMM) are used to implement the individual engines. GMMs are used commonly for speaker identification [20, 21]. During training, the parameters of each talker GMM is determined from a sequence of training vectors. First, the vectors are grouped into clusters using Linde et al.’s (LBG) clustering algorithm [22]. Second, the mean and variance of each cluster are then computed. During testing, the probability that a given talker uttered a certain test speech is computed from the GMMs. Using extracted feature vectors from the test utterance, the engine computes a log-likelihood probability vector as follows: where is the number of elements in this vector which is equal to the numbers of the talkers. The engine selects the talker that uttered the test speech as the talker with the highest log-likelihood value. In this work, a 32-GMM is used as a classifier, with a feature vector of 16 MFCCs.

The accuracy of the engine can be degraded by a mismatch between the training and test speech. In the context of this work, this mismatch occurs when different levels of noise are present in the training and testing sets. In the proposed multiengine system, a set of engines are used, each trained with different levels of noise. For the multiengine , log-likelihood probability vectors are produced, one vector from each engine. Using the estimate from the Discrete-Valued Estimator multiengine fusion is performed.

2.5. Multiengine Selection

Each environment in a set of training environments is modeled by a single covariance matrix . A test noisy utterance from an unknown environment is modeled by a covariance matrix . The distance is computed for each training environment. The engine associated with the minimum distance is selected, and the talker identity determined by this engine is used as the system output (Figure 5). The engine selection is calculated as follows:

The estimated value is the of the selected training environment .

2.6. Multiengine Weighted Fusion

Performance of engines varies in different environments. This performance variation depends on the similarity between the test and the training environment. Using this fact, system can utilize the output decision of each engine and combine the decisions of these engines (Figure 6).

Assume that we have talkers and engines trained in environments. The conditional probability for each talker generated from engine is denoted as and is computed using the following steps. Each engine that is trained in environment generates a maximum log-likelihood vector as follows: where is likelihood of the talker from the engine. These likelihoods are normalized as

This normalization ensures that the values of have properties consistent with probabilities; that is, The probability that the surrounding environment is environment is denoted as and is computed as follows. First, the distance between the surrounding environment and the training environment is measured using the Discrete-Valued Estimator. Then the distance vector is generated as in (7). The training environment with the smallest distance value has the highest similarity with the surrounding environment. The distances are normalized to the probability of the surrounding environments as follows: This normalization ensures the values of have properties consistent with probabilities; that is: The probability of talker is computed for the system as follows: The decision of the Multiengine Weighted Fusion engines is the talker with the maximum probability. The Multiengine Weighted Fusion method generalizes the solution for the speaker identification systems in a variable environment.

3. Experiment and Results

3.1. Audio Database

A speech/nonspeech audio database containing a total of 1356 audio segments was used to train and test the speech/nonspeech classifier. Audio segments were 0.45 seconds in length, with a sampling rate of 8 kHz. The speech/nonspeech audio database contains nonspeech audio (e.g., windows breaking, wind, fan, and footsteps sounds) and speech audio. A full list of the audio types can be found in Table 1. Half of each audio type in the speech/nonspeech audio database was used in training the speech/nonspeech classifier, and the remaining half was used as a test set.

The KING speech database is used in training and testing the system. The speech/nonspeech portion of the system was trained using the previously mention speech/nonspeech audio database. The KING database consists of speech from 51 speakers; each speaker has 10 utterances with the exception of 2 speakers, which only have 5 utterances [23]. Data are sampled at a rate of 8 kHz, with 16-bit resolution, and utterance durations ranging from 30 to 60 seconds, which were all shortened to 15 seconds. The first 3 utterances of each speaker was used to form a training set (153 utterances), and the remaining utterances were used to form a test set (347 utterances). For the speech/nonspeech and KING databases, the training sets and test sets are mutually exclusive.

A total of 6 levels of (−10, 0, 10, 20, and 30 dB and a clean environment) were used for training, using additive white Gaussian noise to adjust the . A total of 11 levels of (−10, −5, 0, 5, 10, 15, 20, 25, 30, and 35 dB and a clean environment) were used for testing; 5 levels of were different from the training levels. Additive white Gaussian noise and pink noise were used to create the different levels of noise; therefore, while the level may be the same in the training and test set, the type of noise may differ.

3.2. Speech/Nonspeech Classification Results

The speech/nonspeech classifier was tested over an range of −10 dB to clean environment, using white and pink noise. The average performance using the pink and white noise of APR algorithm is shown in Table 2. The average classification performance with pink noise is similar to the performance with white noise. Also, the classification performance of KING database is similar to the speech in the speech/nonspeech audio database. Due to the deterioration of the pitch period, APR has a poor speech classification performance of 32.5% and 28.2% for white noise and pink noise at low  dB, respectively. APR maintains an overall accuracy above 93.0% for above 0 dB.

3.3. Discrete-Valued Estimator Results

The Discrete-Valued Estimator was trained using one randomly selected talker from the KING database. Training involved 3 utterances in the training set, associated with this talker, with the 6 training levels. Discrete-Valued Estimator training and testing are conducted using a feature vector of 19 MFSCs and delta MFSCs. The minimum distance is used to classify test cases to one of the training environments as described in Section 2.4. Performance of the Discrete-Valued Estimator is evaluated on the KING database test set, using the 11 testing levels with white and pink noise.

Table 3 summarizes the results of the Discrete-Valued Estimator classification. The total number of utterances for value in the test set is equal to , where = 347 for (clean environment) and for (noisy environments). The elements of Table 3 are the percentage of the total number of utterances that were classified to the training value.

The Discrete-Valued Estimator output decision is assumed to be correct if the of the selected training environment has the closest value to the of the surrounding test environment or one of the adjacent environments in terms of (these are set in bold in Table 3). For example if the surrounding environment has an  dB, then the output decision is assumed to be correct if the selected or 20 dB ().

The classification is not expected to require high precision, which justifies the inclusion of the adjacent values in the definition of a correct decision. Overall, the average accuracy of the Discrete-Valued Estimator is 98.4% overall accuracy.

3.4. Multiengine Selection Results

Six engines were trained using the values of the training environments. Figure 7 plots the results of the 6 individual engines with two different noise types (white and pink noise) using the KING database. The maximum performance of a single occurs when the test data from a particular noisy environment is used with the engine trained in that environment. Also, the performance of engines trained in similar noise environments is better than those trained in environments that greatly differ.

The estimate of the from the Discrete-Valued Estimator is utilized to select the best performing engine for Multiengine Selection method. This method has average classification accuracy of 88.7% in noisy environments over an range from 10 dB to clean environment. Compared to the accuracy using the engine trained in a clean environment, the improvement in accuracy was as high as 30.6%. Also the Multiengine Selection achieved an improvement in accuracy as high as 17% over an range of −10 to 10 dB. Due to inaccuracies in the Discrete-Valued estimator, the Multiengine Selection accuracy is slightly lower than the best individual engine at each level. The Multiengine Selection maintains an accuracy above 75% for above 10 dB, whereas the clean engine only does so for above 30 dB.

3.5. Multiengine Weighted Fusion Results

Figure 7 plots the Multiengine Weighted Fusion engines’ performance using the KING database. The estimate of the from the Discrete-Valued Estimator is utilized to weight and combine the decision of the 6 engines. Figure 7 shows that this Multiengine Weighted Fusion method has average classification accuracy of 91.3% in noisy environments over an range of 10 dB to clean environment. The identification accuracy of the Multiengine Weighted Fusion method outperforms Multiengine Selection method (88.7%) and the identification accuracy using the SI engine trained in a clean environment (58.1%).

For ≥ dB, the classification accuracy of the Multiengine Weighted Fusion engines is superior to all individual engines. For example, at  dB, the engines trained at and 20 dB have an accuracy of 68% and 87%, respectively, the Multiengine Selection classification accuracy has an accuracy of 81%, and the Multiengine Weighted Fusion has an accuracy of 89%. Naturally, the Multiengine Selection method cannot perform better than all individual engines and as previously mentioned is typically worse due to inaccuracies in the Discrete-Valued estimator.

The Multiengine Weighted Fusion method has higher complexity than the Multiengine Selection method, albeit only slightly higher complexity associated with (11), (13), and (15). Meanwhile, the Multiengine Weighted Fusion method has higher classification accuracy.

For variable environments with a known set, the training environments can be matched with the testing environments. In cases, where the training and testing environments are the same (, 20, 10, 0, and −10 dB and clean environment), the Multiengine Selection method has almost similar accuracy as the Multiengine Weighted Fusion method. This makes the Multiengine Selection method perhaps a better solution because of its lower complexity; however, when the testing environments are different from the training environments ( dB), the classification performance enhancement of the Multiengine Weighted Fusion is 7.2% compared to the Multiengine Selection. This suggests that the Multiengine Weight Fusion is a more generalizable solution.

For high- environments ( dB), it seems there is no benefit for a multiengine system. Indeed, the engine trained at  dB has similar accuracy as the multiengine systems (Figure 7); however, it is clearly not robust as there is severe degradation in performance for  dB.

3.6. Speech/Nonspeech Classification with Multiengine

The overall adaptive system accuracy (with estimator and APR) is compared with a nonadaptive system () (Figure 8) using the KING database. The main contributor of the accuracy enhancement of the introduced instrument at above 10 dB is the Multiengine system (from Table 2 the performance of the APR is nearly perfect for above 10 dB). The performance curve of the Multiengine Selection, shown in Figure 8, has a large dip at  dB. For below 10 dB the performance curve could be smoothed by decreasing the spacing between engines.

The classification accuracy performance using a perfect estimator with the APR and Multiengine Selection is shown in Figure 8. Using this value, the best-performing engine and the APR threshold are selected. Compared to APR with Multiengine Weighted Fusion using the Discrete-Valued Estimator, the improvement in accuracy is only 1.6% over range clean to −10 dB. This result justifies the utility of the rough estimation of the value.

4. Conclusion

This paper confirms that the closer the match between the training and testing environment the higher the accuracy. This paper proposed a security instrument that can detect location and identity of talker in noisy environment. To achieve this instrument, an adaptive multiengine speaker identification system and Discrete-Valued Estimator are proposed. The Discrete-Valued Estimator is also utilized to adapt the speech/nonspeech classifier. The proposed instrument is reliable and robust even in variable noise environments.

In the multiengine system, six speaker identification engines were utilized to accommodate five noisy environments in addition to clean environment. The system using the Multiengine Selection and Multiengine Weighted Fusion methods achieved average identification accuracies of 88.7% and 91.3%, respectively; this represents an enhancement of 30.6% and 33.2% over the engine trained in a clean environment, respectively, for  dB. Also the Multiengine Selection and the Multiengine Weighted Fusion achieved an improvement in accuracy as high as 17% and 18.5% over an range of −10 to 10 dB. While the Multiengine Weighted Fusion has high accuracy and reliability than the Multiengine Selection, the tradeoff is a slightly higher complexity.

In future work, this framework of this system (i.e., the combination of the Discrete-Value Estimator and a Multiengine Fusion method) can be deployed in other applications, such as biomedical applications. An example of a biomedical application area that can benefit from the proposed instrument is hearing aid devices. Hearing aids could utilize the proposed instrument to customize their parameters to achieve an optimal hearing performance in variable environments. Modern hearing aids allow the user to set their internal equalizer or beam former variables to predefined settings by selecting the surrounding environment. The deployment of the proposed instrument in hearing aids can automate the selection of the predefined settings according to the surrounding environment.

Acknowledgment

The authors would like to acknowledge the funding from CITO, Mitel, NSERC, and OGS.