Voice Activity Detection in Noisy Environments Based on Double-Combined Fourier Transform and Line Fitting

Park, Jinsoo; Kim, Wooil; Han, David K.; Ko, Hanseok

doi:https://doi.org/10.1155/2014/146040

The Scientific World Journal

On this page

Abstract Introduction Conclusions References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 146040 | https://doi.org/10.1155/2014/146040

Voice Activity Detection in Noisy Environments Based on Double-Combined Fourier Transform and Line Fitting

Jinsoo Park,¹Wooil Kim,²David K. Han,³and Hanseok Ko^1,4

Academic Editor: Juan Manuel Gorriz Saez

Received07 Feb 2014

Revised04 Jul 2014

Accepted10 Jul 2014

Published06 Aug 2014

Abstract

A new voice activity detector for noisy environments is proposed. In conventional algorithms, the endpoint of speech is found by applying an edge detection filter that finds the abrupt changing point in a feature domain. However, since the frame energy feature is unstable in noisy environments, it is difficult to accurately find the endpoint of speech. Therefore, a novel feature extraction algorithm based on the double-combined Fourier transform and envelope line fitting is proposed. It is combined with an edge detection filter for effective detection of endpoints. Effectiveness of the proposed algorithm is evaluated and compared to other VAD algorithms using two different databases, which are AURORA 2.0 database and SITEC database. Experimental results show that the proposed algorithm performs well under a variety of noisy conditions.

1. Introduction

The purpose of voice activity detection (VAD) or speech endpoint detection is to determine the beginning and ending points of a speech signal. VAD is an important step in the signal flow of speech recognition. Accurate VAD can not only improve the accuracy of speech recognition but also reduce the complexity of calculation [1, 2]. VAD has been studied for decades, and many algorithms have been proposed, such as hidden Markov models [3], infor(mation entropy [4], wavelet transform technology [5], and variations of these algorithms.

The signal-to-noise ratio (SNR) was usually taken as an energy cue for discrimination [6, 7]. Besides these simple ones, some advanced energy-based features, such as Teager energy [8] and long-term speech information [9], were derived by enhancing the discrimination between speech and nonspeech. Developing a VAD for noisy environments with low SNR or for any nonstationary noise is still very challenging. Therefore, abundant VAD algorithms have been developed to achieve better performance in real noise environments. Recently, many VAD methods focus on statistical models to discriminate speech and nonspeech. Most statistical models aimed to construct classifiers for speech and nonspeech classification. The conventional classifier-based method employs the Gaussian statistical model with the discrete Fourier transform (DFT) analysis [10, 11]. Based on these researches, multiple observations [12] and multiple statistical models [13] were utilized to further improve the classifiers’ performance, respectively. Furthermore, contextual information derived from multiple observations has been incorporated into the likelihood ratio test (LRT) to improve the robustness of VAD under adverse acoustic environment [14]. A novel VAD based on a multivariate complex Gaussian observation model along with definition of an optimal LRT had been presented [15]. Besides the classical methods, a few statistical models aimed at finding the change points between speech and nonspeech. Speech detection can also be accomplished by localizing a rapidly changing edge point over the features of activity and inactivity zones. Accordingly, Canny’s edge detector, which was previously utilized to detect the edges of an image, is applied to detect speech [16]. For this purpose, an algorithm based on edge detection filter and state transition model is applied to the frame energy normalization feature for speech endpoint detection [17]. In addition, Gaussian mixture models have been applied to model the static harmonicstructure information and the long-term temporal information of speech. VAD decisions are then based on the log likelihood ratios computed from the speech and noise Gaussian mixture model (GMM) [18]. However, these VAD algorithms suffer high computational complexity for two specific reasons. Firstly, they assume that speech and noise are distributed by the Gaussian distribution in the DFT domain. Secondly, noise estimation and adaptation algorithms are considered to improve its robustness under nonstationary noise environments at the cost of additional computation. Therefore, in this paper, a new feature set which is robust for noisy speech environment while keeping low-complexity for real-time implementation is proposed. The new feature based VAD is further made stable in performance by applying an edge detection filter. In particular, a combination of double-combined Fourier transform (DCFT) and subsequent envelope line fitting is proposed as a feature set such that the pattern of the feature envelope between speech and nonspeech regions becomes more distinguishable, yielding more stable detection results. Experimental evaluations confirm the potential performance of the proposed algorithm under various and real noise environments. The structure of this paper is as follows. In Section 2, the conventional algorithms of VAD are reviewed. Section 3 describes the proposed VAD algorithm based on DCFT and feature envelope line fitting. In Section 4, the performance evaluation is made through representative experiments. Lastly, Section 5 presents the conclusions of this paper.

2. Review of Conventional Feature Models for VAD

In this section, we review the conventional VAD approaches proposed by [17, 18], which are based on frame energy normalization and GMM mapping based on static harmonic feature. The main contribution of our approach will be described in Section 3.

2.1. Frame Energy Feature Based Edge Detection Filter with State Transition Model

In endpoint detection using a simple edge detection filter, the objective is to find the edges, which are abruptly changing the feature points caused by existence of speech. It is used to find the edge component with large change of frame energy with the notion that energy increases in speech beginning region and decreases in speech ending region. In addition, final speech endpoint detection is done by applying the results of edge detection filter to state transition model [17].

As shown in Figure 1, since an edge filter is an odd function, if the value of the energy feature in nonspeech region varies smoothly and constantly regardless of its magnitude, the filter output approximates it to zero. It is also notable that if the energy feature value increases at the beginning of speech region, the filter output increases. On the contrary, if the energy feature value decreases at the ending point of speech region, the subsequent filter output also decreases.

The edge detection filter can be defined as in where represents the length of the filter, is an integer between and W, and A and are filter parameters, respectively. For and the filter parameters provided by [11] as A = 0.41, [, , , , , ][1.538, 1.468, −0.078, −0.036, −0.872, −0.56]. Figure 1 is an example of the filter response when .

The edge detection filter output can be calculated by applying the frame energy to the edge detection filter as in where is the frame number.

Final beginning and ending points decision needs to be made by comparing the value of with some predetermined thresholds. Due to the sequential nature of the detector and the complexity of the decision procedure, we use a state transition model to make the final decisions.

Figure 2 shows the state transition model to detect the beginning and ending points of a speech signal. As shown in Figure 2, the three states indicated include silence, in-speech, and leaving-speech. Silence and in-speech represent the nonspeech and speech region, respectively. Leaving-speech is a state belonging to speech region but has the possibility of turning into a nonspeech region. Inthe state transition model, the input is and the output is the detected frame numbers of beginning and ending points. Count is a frame counter, (lower threshold) and (upper threshold) are two thresholds with > , and Gap is an integer indicating the required number of frames from a detected endpoint to the actual ending of speech. By assuming that silence state is the starting state, the state diagram stays in the silence state until < . For cases of ≥ , it can be said that a beginning point is detected and the state goes into the in-speech state. During the stay in the in-speech state, it moves to the leaving-speech state and sets Count = 0 if < . It stays in the leaving-speech state where Count < Gap and then moves to the silence state where Count ≥ Gap. At silence state, it can be said that an ending point is detected. It is noted that the thresholds, and , can be computed from the values of filter output by the least squares (LS) estimation method which minimizes the squared error between the observation and the threshold. According to the estimated threshold values, and are determined as 3.0 and −3.0 by averaging the filter output for over 10 sequential frames. The speech endpoint detection procedure is completed by mapping the result of the edge detection filter to an appropriate state transition model [16, 17]. However, the performance of the conventional frame energy based algorithm is degraded in noisy environments. Figure 3(a) shows the speech signal at SNR of −5 dB in a car-noise environment while Figure 3(b) shows the beginning and ending points of clean speech obtained manually. As shown in Figure 3, the feature of the frame energy in noisy environment does not provide good results, as it is very unstable due to the large value of fluctuations. In Figure 3, it fails to detect the ending point of the speech and yields a slightly incorrect beginning point.

(a)

(b)

(c)

(d)

2.2. Static Harmonic Feature Based GMMs

This section outlines the procedure of extracting the harmonic features from noisy signals and describes how to use these features to discriminate between speech and nonspeech frames by using GMMs. The harmonic structures of speech and background noise are more distinguishable and more noise robust. Based on this argument, Fukuda et al. [18] extracted the harmonic structure based feature from the middle range of cepstral coefficients obtained from the discrete cosine transform (DCT) of the power spectral coefficients. The power spectrum of observed speech is first obtained by taking fast Fourier transform (FFT), which is then followed by taking logarithm to produce a log power spectrum , where and are the frame number and frequency bin index, respectively. Then, a cepstrum (n) is obtained by applying DCT to the log power spectrum as where is the length of and is the cepstral index. The cepstral coefficients with small and large indexes are liftered out because they include long and short oscillations. Therefore, the following liftering process is applied to the cepstrum as where (=) is a small constant and and are the lower and upper limit of cepstral indexes corresponding to the range of pitch frequencies in human voice. The liftered cepstrum is converted back to the log power spectrum by taking an inverse discrete cosine transform (IDCT) and exponential transform to produce the linear power spectrum . The coefficients are finally converted to mel-cepstrum by applying a mel-scale filter bank and DCT, where is the bin number of the harmonic structure based mel-cepstral coefficients. This feature captures the envelope information of the local peaks in the frequency spectrum corresponding to the harmonic information in speech signals.

GMM based VAD is considered as a type of statistical model based VAD in which the feature vectors can be characterized by a mixture of Gaussian distributions. Hence, the decision rule in each frame is obtained by comparing the loglikelihood ratio with decision threshold : where andare the probability density functions of a speech absent frame and a speech present frame at n (: nonspeech GMM, : speech GMM), respectively.

This algorithm extracts and concatenates the two kinds of feature vectors (long-term temporal cepstra and harmonic structure information), which form the 26-dimensional feature vectors by using FFT, DCT, and IDCT in sequence [18]. However, since this algorithm extracts 26-dimensional features and transforms repeatedly between domains, it requires large and complex computations. Our proposed algorithm extracts computationally attractive and less burdening five-dimensional features that achieve effective and reliable VAD results, as verified by the CPU time and computational cost comparison experiment in Section 4.

3. The Proposed Line Fitting Model for Feature Set

The proposed feature extraction technique for VAD application is essentially based on DCFT and a line fitting procedure of the feature envelope, which is generally used for classification of noise sources. First, the DCFT output pattern (Section 3.1 and Figure 4) of a noisy speech (speech plus noise) signal and noise signal is obtained. Then, in order to describe the feature envelope using a small number of features, a feature envelope line fitting algorithm (Section 3.2) is developed. Finally, the distance between the input signal and estimated noise is utilized as a measure for segmentation (Section 3.3). In order to detect the activity of speech, the edge detection filter is applied to the distance measure over time.

(a)

(b)

3.1. Double-Combined Fourier Transform

Let signal be the noisy speech signal at time interval . Signal is divided and overlapped into segments with 256-sample length and 128-sample length. Each segment is windowed, using a Hamming window, and then transformed via FFT. A frequency-domain representation of time-domain signal is obtained by where and are the frequency bin index and FFT point size, respectively. For the kth index, (6) becomes and is the magnitude of the 1st FFT at index . In order to proceed to the next FFT, it can be assumed that the magnitude of the 1st FFT is a time-domain signal, wherein the magnitude of the 1st FFT is considered as input to the 2nd FFT. Hence, 2nd FFT, , is obtained by taking FFT of the magnitude of the 1st FFT, , as where is the DCFT (e.g., 2nd FFT) bin index and 256 FFT points are used for all FFT steps. This transform is wellsuited for analyzing harmonic components to compute the fundamental frequency of the signal. When a harmonic signal is observed, its Fourier transform has a series of peaks in its spectrum magnitude corresponding to the harmonics of the signal. Figure 4(a) shows the magnitude of the 1st FFT of a noisy speech signal and that of a noise signal alone. It is difficult to distinguish between the two signals. Since there are harmonic signals that fluctuate periodically in the speech region, if the 2nd FFT is performed using the result of (a), the two signals can be better distinguished.

3.2. Line Fitting Feature Extraction

In this paper, the difference in the spectrum envelope pattern of the noise signal over the nonspeech region and that of the noisy speech signal over the speech region is exploited for information extraction. The resulting features of the proposed algorithm are effective when classifying the spectrum envelope patterns for various noise types and it is expected that the algorithm can be applied effectively to VAD. Five features are extracted from the resulting DCFT output by constructing two logarithmic curves as the best fit to the harmonic structure of the DCFT. The five extracted features include the mean index and the coefficients used to line-fit the envelope. It is expected that the proposed line fitting procedure separates speech cluster from noise effectively by capturing the essence of speech signal characteristics. Line fitting over the DCFT (e.g., 2nd FFT representing spectral dynamics) is essentially an envelope modulation of spectral dynamics conveying statistically significant amounts of information. As observed by many speech samples in noise, the first (lower frequency bins of DCFT) line fitting (e.g., first slope) captures low frequency energy bins whereas the second (higher frequency bin region of DCFT) line fitting (e.g., second slope) captures high frequency energy bins.

Figure 5 which is plotted on logarithmic frequency scale shows the lines that are fit by the magnitude of the DCFT obtained from the noise-only signal and the noisy signal of Figure 4(b), respectively. As shown in Figure 5, it is also observed in the DCFT bins that while the first slope of the noisy speech features overlaps quite well over the noise-only features, the second slope clearly shows a prominent difference as the noisy speech features display distinguishable harmonics above the floor made by noise-only features.

(a)

(b)

Figure 6 shows the line fitting to the low frequency and high frequency slopes of the magnitude of the DCFT for the noisy speech signal. Figure 6(a) was plotted on a logarithmic frequency scale while (b) was plotted on a linear frequency scale. As shown in Figure 6, the segmentation between the first line fitting and second line fitting is made by taking the mean index of the horizontal axis (e.g., DCFT bin index). It reflects the center of gravity of the DCFT energy bins, thus providing a solid dividing line between low frequency energy bins and high frequency energy bins. Rather than taking the full spectrum of the DCFT energy bins, the envelope modulation modeled by two slopes and mean index works quite well in successfully capturing the essence of critical information. As a result, the combined feature set (e.g., two line fittings in the form of slopes and mean index as separating point) can be used to effectively classify the speech over noise hypothesis.

(a)

(b)

The DCFT energy bins present a spectrum magnitude that can be described by determining the region having the concentration of spectral power and determining whether the spectrum is broadband or narrowband. This is accomplished by computing the first moment of the spectrum on a logarithmic frequency scale to give the mean frequency bin index. It provides the bin location at which the majority of the signal is contained. The mean frequency bin index of a spectrum is calculated as the sum of the product of the spectrum magnitude and the frequency bin index, divided by the total sum of spectrum magnitude. The mean frequency bin index is found by where is the mean index, is the magnitude of the DCFT at index j, and is the size of the DCFT.

The spectrum is to be modeled by a pair of line fitting segments on a logarithmic frequency scale. The slopes of both the low and high indices are determined by fitting the line to the log response. One line fitting of the low index slope describes the envelope below the mean index, and the other line fitting of the high index slope describes the envelope above the mean index. Each line fitting model is given by where is a constant and gives the slope. For the low index region, the mean-square error between the magnitude and the line fitting (first slope) to be minimized is given by where is the DCFT bin index just below the mean index and the factor is due to the logarithm frequency scale and degree of freedom used for the mean-square error.

The solution to the low index slope linear regression coefficients is given by

For the low index line fitting, index of the summation goes from 1 to and for the high index line fitting, index of the summation goes from to . These modifications give, for the high index slope linear regression coefficients ,

Finally, the mean index, the two coefficients used for the envelope line fitting of the low indices, and the other two coefficients used for the envelope line fitting of the high indices form the five-dimensional intermediate features.

3.3. Distance Measure for Segmentation

It is assumed that there are only noise signals in the first 10 frames (160 ms) and that the noise signal can be estimated by averaging. Then, a five-dimensional feature set representing the noise signal is obtained as a reference feature. At the next time point, another five-dimensional feature set is extracted frame by frame from the input signal. If the input signal is a noise signal, the input feature will be similar to the reference signal. On the contrary, if the speech signal is included in the current frame, the input feature will be different from the reference feature. With this motivation, a distance is proposed as a measure for segmentation. The distance, for example, at the th frame can be represented as in where is the number of dimensions of the feature vector and and represent the line fitting features of the input signal and estimated noise, respectively.

Figures 7, 8, 9, and 10 show the actual detection results in the case where the edge detection filter is applied to the distance measure of the speech data at −5 dB SNR in the car-, exhibition hall-, babble-, and restaurant-noise environment, respectively, as contained by the AURORA 2.0 database. Figure 11 shows another result obtained from the experiment for the noisy speech collected under high speed driving condition on highway (−8 dB SNR on average), a case whose result of the proposed algorithm is applicable to the overall VAD performance comparison. This result indicates that the proposed VAD algorithm shows stable performance in real noise environment of high noise intensity as well.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

4. VAD Performance Experiments

4.1. Experimental Setup

In this section, the performance of the proposed VAD algorithm is evaluated on the experimental data. Two well-known standard databases, AURORA 2.0 database and SITEC (Speech Information Technology & Industry Promotion Center) database, are employed for evaluations.

AURORA 2.0 database is a widely used training and testing standard database for speech recognition. Hence, it is appropriate as the database for evaluating VAD performance. It contains clean speech signal of English connected digits, and noisy speech signals have been recorded at different places representing both stationary noise (exhibition hall, car) and nonstationary noise (babble, restaurant) environments with varying SNRs including −5, 0, 5, 10, 15, and 20 dB. A total of 1,001 sentences are included in each SNR.

SITEC (Car03) is a database containing noisy speech collected from a microphone attached to the center of the sun-visor of a car at low speed (40–60 km/h) driving on city street and at high speed (70–90 km/h) driving on highway [20, 21]. The speech data were downsampled to 8 kHz for the experiments. 200 utterances, including a total of 400 Korean words collected at both low speed and high speed driving conditions, respectively, were used. In order to measure the intensity of the background noise for low speed driving and high speed driving, the SNR corresponding to each noise environment was obtained by means of a spectral subtraction procedure [19]. The obtained SNR shows −3 dB and −8 dB for low speed and high speed driving, respectively. The frame length for feature extraction is 256 samples (32 ms) while frame moving distance is 128 samples (16 ms).

4.2. Utterance Based Speech Segment Detection Test

First, we discuss effectiveness of the proposed VAD in terms of detecting speech segments correctly. In the experiment, a connected digit string is treated as an utterance in the AURORA 2.0 database. The beginning and ending points of the clean speech utterance were labeled manually based on the time unit. This manual detection information is used as a reference to determine the accuracy of those detected by the proposed algorithm. For the performance evaluation, the detection results of the conventional and proposed algorithm were compared to the ground truth reference. The following metrics are defined to evaluate the performance of the proposed algorithm.(i)The probability of correctly detecting speech segments, , computed as the ratio of the number of correctly detected utterances to the total number of test utterances for each environment.(ii)The probability of falsely detecting speech segments, , computed as the ratio of the number of falsely detected utterances to the total number of test utterances for each environment.

Correct detection implies that there is no error or that the speech region is not damaged and includes those cases where the beginning point or ending point is equivalent or similar to the manual result. In other words, the detected segment (detected beginning point and ending point) is counted as correctly detected utterance in case it matches the manual result or when it is contained within the margins. The margin was set to about 0.08 seconds ahead of the manually detected beginning point and behind the ending point of the utterance. On the contrary, if either the detected beginning point is located behind the manually detected beginning point or the ending point is located ahead of the manually obtained result, false detection is admitted. In addition, if the detected beginning or ending point is beyond the margins, false detection is acknowledged.

The performance of the proposed algorithm was compared to that of the most prominent conventional algorithms (e.g., Sohn et al. [10], Ramírez et al. [14], Górriz et al. [15], Li et al. [17], and Fukuda et al. [18]). Note that the most recent conventional algorithms were selected among the edge detection based filtering procedures for performance comparison. Additionally, our algorithm was compared with the recent harmonic, long-term speech information and statistical based algorithms as well.

Table 1 shows the experimental results in terms of values over various noise types and SNRs. In the experiments, high SNR includes speech data with clean 20, 15, and 10 dB signals, while the low SNR contains that with 5, 0, and −5 dB. As shown in Table 1, the average using the proposed algorithm is higher compared to those using the conventional algorithms. This result demonstrates that the proposed feature set is robust against background noise, which essentially exploits the difference in the spectrum envelope pattern of the noise signal over the nonspeech region and that of the noisy speech signal over the speech region. The proposed algorithm shows average improvement of by 5.9% over Sohn et al. [10], 0.9% over Ramírez et al. [14], 0.2% over Górriz et al. [15], 14.1% over Li et al. [17], and 2.7% over Fukuda et al. [18] for all representative realnoise environments considered. Li et al. [17] have the lowest performing results among the considered algorithms in the entirenoise environments. Furthermore, Li et al. [17] and Sohn et al. [10] suffered poor speech segment detection performance in low SNR conditions. While Ramírez et al. [14] and Górriz et al. [15] produced slightly better performance in babble and restaurant noise, they performed poorer in exhibition hall- and car-noise environments compared to the proposed algorithm. Detecting or recognizing only the target speaker’s voice is difficult in nonstationary noise environment such as babble or restaurant which contains other people’s voices as noise sources. In this situation, combining VAD with beamforming algorithm using a microphone array or blind source separation algorithm using nonnegative matrix factorization (NMF) seems promising.

4.3. Frame Based Speech and Nonspeech Discrimination Test

Second, the proposed VAD was evaluated in terms of the ability to discriminate between speech and nonspeech regions at different SNR levels. Again, the performance was measured using AURORA 2.0 database. The beginning and ending points of the clean speech utterance were obtained manually based on the frame unit. In order to evaluate the performance of the proposed VAD algorithm, experimental results were analyzed using two metrics which are known as nonspeech hit rate (HR0) and speech hit rate (HR1).(i)HR0 is computed as the ratio of the number of correctly detected nonspeech frames to the number of real nonspeech frames.(ii)HR1 is computed as the ratio of the number of correctly detected speech frames to the number of real speech frames.

Since there are always trade-off relationships among these two metrics, the average of HR0 and HR1 is used as the metric for better performance comparison. Table 2 compares the performance of the proposed VAD to conventional algorithms as mentioned in Section 4.2 for clean 20, 15, 10, 5, 0, and −5 dB. These results are averaged hit rates for the four types of noise considered in AURORA 2.0 database. In Table 2, we observed that the proposed algorithm achieved similar performance in detecting speech and nonspeech region when compared to Górriz et al. [15]. However, the proposed algorithm achieved better performance in detecting speech with 75.8% average value. It is demonstrated that the proposed algorithm produced an improvement in HR0 compared to the conventional ones.

An additional test was conducted to compare the speech and nonspeech detection performance in real driving environment by means of receiver operating characteristics (ROC). The ROC curves are used to completely describe the VAD error rate. The experiments were conducted using the SITEC databases Car03 of Korean speech recorded in driving car [20, 21]. Since the SITEC database used in the experiment does not contain clean speech, manual detection became a difficult task. Therefore, beginning point and ending point were obtained manually based on the frame unit after estimating the noisy speech with clean speech by using the spectral subtraction algorithm [19] for more accurate speech detection. This manual detection information is used as a reference to determine the accuracy of those detected by the proposed algorithm. The HR0 and the false alarm rate (FAR = 100-HR1) were determined in each noise condition. Figure 12 shows the ROC curves of the proposed and conventional algorithms for the environment of high speed driving on highway (−8 dB SNR, 70–90 km/h). The results show improvements in detection accuracy over representative VAD algorithms. Thus, among all the VAD tested, our VAD algorithm resulted with the lowest FAR for a fixed HR0 and also the highest HR0 for a given FAR. As the experimental results show, the proposed DCFT and line fitting combined together become more robust features, especially in heavy noise under driving environment, compared to the conventional algorithm.

4.4. Computational Load and Robustness of the Proposed Feature Set

Another important performance measure is in terms of computational load. Additional advantage of the proposed algorithm is that the computational complexity created by the proposed algorithm is low. The computational complexity of the proposed algorithm was compared with that of its most comparably performing conventional algorithm (Górriz et al. [15]). Computation load aspect was separately evaluated in the two cases (proposed versus Górriz et al. [15]) in terms of CPU time and computational cost. To verify this, the CPU time was measured by performing on MATLAB with 2.9 GHz clock cycle and the computational cost based on the arithmetic operations with addition = 1, multiplication = 1, division = 5, and exponential = 10, for a noisy speech signal. Then the CPU time and computational cost of the proposed algorithm over the conventional algorithm were compared as shown in Table 3. As indicated by Table 3, the conventional algorithm required longer CPU time and higher computation cost compared to the proposed approach. It clearly shows that the proposed low-dimensional feature set is computationally less burdening and achieves more effective and reliable segmentation results.

In order to validate our proposition delineated in Section 3.2, we have conducted an additional set of experiments. In particular, an analysis of model clustering was conducted to investigate if the proposed feature extraction algorithm produces distinct segments. The proposed features were extracted from three sets of acoustic data (e.g., noise-only, noisy speech, and speech-only) and obtained three clusters from them.

Table 4 shows the experimental results, which indicate that the Euclidean distance between the noise-only and noisy speech cluster is similar to that between the noise-only and speech-only cluster, respectively, at various SNRs. As SNR decreases from −5 to 20 dB, however, the similarity between each cluster also tends to decrease but the overall error is confined to being small. It reinforces the fact that a set of five-dimensional features representing envelope modulation is quite effective for segmentation of noisy speech from that of noise-only regions.

5. Conclusions

This paper proposed a new VAD technique to improve automatic speech recognition innoisy environment. The proposed feature extraction technique, based on the DCFT and line fitting algorithm, was shown to be efficient and yet reliable. At the final step, the Euclidean distance was obtained as a measure for segmentation, and subsequently an edge detection filter was applied. Representative detection experiments conducted confirmed that the proposed algorithm is superior to the conventional algorithms by an average of 4.5% and 11.2%, in terms of correct detection probability and hit rate, respectively. In addition, the proposed algorithm confirms its superiority in terms of computational load and CPU time for processing compared to its most comparably performing conventional algorithm. Based on the analysis and results, the proposed low-complexity feature set is attractive and feasible for real-time implementation of VADs.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported by Seoul R&BD Program (WR080951).

References

J. Beh, R. H. Baran, and H. Ko, “Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment,” IEEE Transactions on Consumer Electronics, vol. 52, no. 2, pp. 583–589, 2006.
View at: Publisher Site | Google Scholar
J. Beh and H. Ko, “Spectral subtraction using spectral harmonics for robust speech recognition in car environments,” in Computational Science, vol. 2660 of Lecture Notes in Computer Science, pp. 1109–1116, 2003.
View at: Google Scholar
J. G. Wilpon and L. R. Rabiner, “Application of hidden Markov models to automatic speech endpoint detection,” Computer Speech and Language, vol. 2, no. 3-4, pp. 321–341, 1987.
View at: Publisher Site | Google Scholar
B. Wu and K. Wang, “Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 762–774, 2005.
View at: Publisher Site | Google Scholar
N. S. A. Kadel and A. M. Refat, “End points detection for noisy speech using a wavelet based algorithm,” in Proceedings of the 16th National Radio Science Conference (NRSC ’99), pp. C18/1–C18/5, February 1999.
View at: Google Scholar
“Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) Speech Traffic Channels, ETSI EN 301 708 Recommendation, ETSI, 1999”.
View at: Google Scholar
“Speech processing, transmission and quality aspects (STQ), Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm,” ETSI ES 202 050 Recommendation, ETSI, 2002.
View at: Google Scholar
M. Bahoura and J. Rouat, “Wavelet speech enhancement based on the Teager energy operator,” IEEE Signal Processing Letters, vol. 8, no. 1, pp. 10–12, 2001.
View at: Publisher Site | Google Scholar
J. Ramírez, J. C. Segura, C. Benítez, Á. de la Torre, and A. Rubio, “Efficient voice activity detection algorithms using long-term speech information,” Speech Communication, vol. 42, no. 3-4, pp. 271–287, 2004.
View at: Publisher Site | Google Scholar
J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999.
View at: Publisher Site | Google Scholar
J. M. Górriz, J. Ramírez, E. W. Lang, and C. G. Puntonet, “Jointly gaussian pdf-based likelihood ratio test for voice activity detection,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 8, pp. 1565–1578, 2008.
View at: Publisher Site | Google Scholar
J. Ramírez, J. C. Segura, C. Benítez, L. García, and A. Rubio, “Statistical voice activity detection using a multiple observation likelihood ratio test,” IEEE Signal Processing Letters, vol. 12, no. 10, pp. 689–692, 2005.
View at: Publisher Site | Google Scholar
J. Chang, N. S. Kim, and S. K. Mitra, “Voice activity detection based on multiple statistical models,” IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 1965–1976, 2006.
View at: Publisher Site | Google Scholar
J. Ramírez, J. C. Segura, J. M. Górriz, and L. García, “Improved voice activity detection using contextual multiple hypothesis testing for robust speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2177–2189, 2007.
View at: Publisher Site | Google Scholar
J. M. Górriz, J. Ramírez, E. W. Lang, C. G. Puntonet, and I. Turias, “Improved likelihood ratio test based voice activity detector applied to speech recognition,” Speech Communication, vol. 52, no. 7-8, pp. 664–677, 2010.
View at: Publisher Site | Google Scholar
Q. Li and A. Tsai, “A matched filter approach to endpoint detection for robust speaker verification,” in Proceedings of the IEEE Workshop on Automatic Identification, Summit, NJ, USA, October 1999.
View at: Google Scholar
Q. Li, J. Zheng, A. Tsai, and Q. Zhou, “Robust endpoint detection and energy normalization for real-time speech and speaker recognition,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 3, pp. 146–157, 2002.
View at: Publisher Site | Google Scholar
T. Fukuda, O. Ichikawa, and M. Nishimura, “Long-term spectro-temporal and static harmonic features for voice activity detection,” IEEE Journal on Selected Topics in Signal Processing, vol. 4, no. 5, pp. 834–844, 2010.
View at: Publisher Site | Google Scholar
S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.
View at: Publisher Site | Google Scholar
Y. Lee, B. Kim, Y. Kim, D. Choi, K. Lee, and Y. Um, “Creation and assessment of Korean speech and noise DB in car environment,” in Proceedings of the International Conference on Language Resources and Evaluation, pp. 1403–1406, Lisbon, Portugal, May 2004.
View at: Google Scholar
Y. Lee, B. Kim, and Y. Um, “Speech information technology & industry promotion center in Korea: activities and directions,” in Proceedings of the International Conference on Language Resources and Evaluation, pp. 1851–1854, 2002.
View at: Google Scholar

Copyright

Copyright © 2014 Jinsoo Park et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

8670

Downloads

1651

Citations