Performance of speech recognition systems strongly degrades in the presence of background noise, like the driving noise inside a car. In contrast to existing works, we aim to improve noise robustness focusing on all major levels of speech recognition: feature extraction, feature enhancement, speech modelling, and training. Thereby, we give an overview of promising auditory modelling concepts, speech enhancement techniques, training strategies, and model architecture, which are implemented in an in-car digit and spelling recognition task considering noises produced by various car types and driving conditions. We prove that joint speech and noise modelling with a Switching Linear Dynamic Model (SLDM) outperforms speech enhancement techniques like Histogram Equalisation (HEQ) with a mean relative error reduction of 52.7% over various noise types and levels. Embedding a Switching Linear Dynamical System (SLDS) into a Switching Autoregressive Hidden Markov Model (SAR-HMM) prevails for speech disturbed by additive white Gaussian noise.
1. Introduction
The automatic recognition of speech, enabling a natural and
easy to use method of communication between human and machine, is an active
area of research as it still suffers from limitations such as the restricted
applicability whenever human speech is superposed with background noise
[1–3]. Since the interior of a
car is a popular field of application for speech recognisers, allowing
hands-free operation of the centre console or text messaging, the car noises
produced during driving are of great interest when designing a noise robust
speech recognition system [4, 5].
To enhance recognition performance in noisy
surroundings, different stages of the recognition process have to be optimised.
As a first step, filtering or spectral subtraction can be applied to improve
the signal before speech features are extracted. Well-known examples for such
approaches are applied in the advanced front-end feature extraction (AFE) or
Unsupervised Spectral Subtraction (USS). Then, suitable patterns for auditory
modelling have to be extracted from the speech signal to allow a reliable
distinction between the phonemes or word classes in the vocabulary of the
recogniser. Apart from widely used features like Mel-frequency cepstral
coefficients (MFCCs), the extraction of Perceptual Linear Prediction (PLP)
coefficients is an effective method of speech representation [6].
The third stage is the enhancement of the obtained
features to remove the effects of noise. Normalisation methods like Cepstral
Mean Subtraction (CMS) [7], Mean and Variance Normalisation
(MVN) [8], or Histogram Equalisation (HEQ) [9] are
techniques to reduce distortions of the frequency domain representation of
speech. Alternatively, model-based feature enhancement approaches can be
applied to compensate the effects of background noise. Using a Switching Linear
Dynamic Model (SLDM) to capture the dynamic behaviour of speech and another
Linear Dynamic Model (LDM) to describe additive noise is the strategy of the
joint speech and noise modelling concept in [10] which aims to estimate the clean
speech features of the noisy signal.
The derivation of speech models can be considered as
the next stage in the design of a speech recogniser. Hidden Markov Models
(HMMs) [11]
are commonly used for speech modelling whereas numerous alternatives, like
Hidden Conditional Random Fields (HCRFs) [12], Switching Autoregressive
Hidden Markov Models (SAR-HMMs) [13], or other more general Dynamic
Bayesian Network structures have been developed in recent years. Extending the
SAR-HMM to an Autoregressive Switching Linear Dynamical System (AR-SLDS), as in
[14], includes an explicit noise model and leads to an increased
noise robustness compared to the SAR-HMM.
Speech models can be adapted to noisy conditions when
the training of the recogniser is conducted using noisy training material.
Since the noise conditions during the test phase of the recogniser are not
known a priori, equal properties of the noises for training and testing hardly
occur in reality. However, in case the recogniser is designed for a certain
field of application as an in-car speech recogniser, the approximate noise
conditions are known to a certain extent, for example, when using information
about the current speed of the car. Therefore, the speech models can be trained
using speech sequences corrupted by noise which has similar properties as the
noise during testing.
In this article, the most promising approaches to increase
recognition performance in noisy surroundings are implemented in an isolated
digit and spelling recognition task. All denoising techniques applied in the
experimental section, representing a selection of methods as simple and
efficient as CMS, MVN, and HEQ but also more complex approaches like AFE, USS,
and SLDM feature enhancement as well as novel noise robust model architecture
such as HCRF or the AR-SLDS, are introduced in Sections 3 to 5. While it is
impossible to take into account and implement all noise compensation techniques
that were developed in recent years, the selection of methods in this work
covers many of the different concepts that are thinkable for in-car, but also
for babble and white noise scenarios with all their specific advantages and
disadvantages. Since we aim to focus on in-car speech recognition, noises
produced by four different cars and three different road surfaces and
velocities have been recorded and superposed with the speech sequences to
simulate the noise conditions during driving. However, the findings may be
transferred for many similar stationary noise situations.
Section 2 briefly outlines possible approaches to
enhance the noise robustness of speech recognisers. In Section 3, an
explanation of the different speech signal preprocessing techniques applied in
this article is given, while Section 4 focuses on the feature enhancement
strategies we used. Section 5 describes the speech model architecture which are
used as alternatives to Hidden Markov Models in some of the experiments of
Section 6.
2. Concepts for Noise Robust Speech Recognition
Aiming to
counter the performance degradation of speech recognition systems in noisy
surroundings, a variety of different concepts have been developed in recent
years. The common goal of all noise compensation strategies is to minimise the
mismatch between training and recognition conditions, which occurs whenever the
speech signal is distorted by noise. Consequently, two main methods can be
distinguished. One is to reduce the mismatch by focusing on adapting the
acoustic models to noisy conditions in order to enable a proper representation
of speech even if the signal is corrupted by noise. This can be achieved either
by using noisy training data [15] or by joint speech and noise modelling [14]. The
other method is trying to determine the clean features from the noisy speech
sequence while using clean training data [9, 16, 17]. For
that purpose, it is necessary to extract noise robust features and to find
appropriate means of signal or feature preprocessing for speech enhancement.
This section summarises selected methods for speech
signal preprocessing, auditory modelling, feature enhancement, speech
modelling, and model adaptation.
2.1. Speech Signal Preprocessing
Preprocessing
techniques for speech enhancement aim to compensate the effects of noise before
the signal or rather the feature-based speech representation is classified by
the recogniser which has been trained on clean data [18–20].
A state-of-the-art speech signal preprocessing that is
used as a baseline feature extraction algorithm for noisy speech recognition
problems like the Aurora2 task [21] is the advanced front-end
feature extraction introduced in [22]. It uses a two-step Wiener
filtering technique before the features are extracted, whereas filtering is
done in the time domain.
As shown in [23, 24], methods based on spectral subtraction like
Unsupervised Spectral Subtraction [17] reach similar performance while
requiring less computational cost than Wiener filtering. Like the two-step
Wiener filtering method included in the AFE, Unsupervised Spectral Subtraction
can be considered as speech signal preprocessing step; however, USS is carried
out in the magnitude spectogram domain.
2.2. Auditory Modelling and Feature Extraction
The two major
effects that noise has on speech representation are a distortion in the feature
space and a loss of information caused by its random behaviour. This loss has
to be considered as irreversible, whereas the distortion of the features can be
compensated depending on the suitability of the speech representation in noisy
environments [1, 4].
Widely used speech features for auditory modelling are
cepstral coefficients obtained through Linear Predictive Coding (LPC). The
principle is based on the assumption that the speech signal can be regarded as
the output of an all-pole linear filter that simulates the human vocal tract.
However, speech recognition systems which process the cepstrum calculated via
LPC tend to have low performance in the presence of noise [2]. For enhanced
noise robustness, the use of the Perceptual Linear Prediction analysis method
is a popular approach to extract spectral patterns [6, 25]. The
technique is based on a transformation of the speech spectrum to the auditory
spectrum that considers multiple perceptual relationships prior to performing
linear prediction analysis. Another well-known speech representation is the
extraction of Mel-frequency cepstral coefficients which provide a basis for
several speech signal analysis applications [17, 26–28]. They are calculated from the
logarithm of filterbank amplitudes using the Discrete Cosine Transform.
In [29], the TRAP-TANDEM features were
introduced. They describe the likelihood of subword classes at a time instant
by evaluating temporal trajectories of band-limited spectral densities in the
vicinity of the regarded time instant. Thereby the TRAP refers to the way the
linguistic information is obtained from speech, while TANDEM refers to the
technique that converts the evidence of subword classes into features for
HMM-based speech recognition systems. Unlike conventional feature extraction
techniques, which consider time windows of about 25 milliseconds to derive
spectral features, TRAP also includes relatively long time spans up to one
second to extract information for the recogniser. The strategy is motivated by
the finding that information about a phoneme spreads over about 300
milliseconds [30, 31]. Furthermore, this method is able to
remove slow varying noise [32].
Another approach to suppress slow variations in the
short-term spectrum is the RASTA-PLP concept [33, 34] that makes PLP features
more robust to linear spectral distortions. The filtering of time trajectories
of critical-band filter outputs enables the removal of constant spectral
components caused by convolutive factors in the speech signal.
2.3. Feature Enhancement
Further
attempts to reduce the mismatch between test and training conditions are
Cepstral Mean Subtraction [7], Mean and Variance Normalisation
[8], or the Vector Taylor Series approach [35] which is
able to deal with the nonlinear effects of noise. Nonlinear distortions can also
be compensated by Histogram Equalisation [9], a technique which is often
used in digital image processing [36] to improve the contrast of pictures. In
speech processing, HEQ is a powerful means of improving the temporal dynamics
of feature vector components distorted by noise. A cepstrum-domain feature
compensation algorithm aiming to decompose speech and noise had also been
presented in [37].
Another preprocessing approach to enhance noisy MFCC
features is proposed in [10]: here a Switching Linear Dynamic
Model is used to describe the dynamics of speech while another Linear Dynamic
Model captures the dynamics of additive noise. Both models serve to derive an
observation model describing how speech and noise produce the noisy
observations and to reconstruct the features of clean speech. This concept has
been extended in [38] where time-dependencies among the discrete state variables of
the SLDM are included. To improve the accuracy of the noise model for
nonstationary noise sources, [39] employs a state model for
the dynamics of noise.
An enhancement of speech features can also be attained
by incremental online adaptation of the feature space as in the feature space
maximum likelihood linear regression (FMLLR) approach outlined in [40]. There, an
FMLLR transform is integrated into a stack decoder by collecting adaptation
data during recognition in real time.
2.4. Architecture for Speech Modelling
The most
popular model architecture to represent speech characteristics in automatic
speech recognition is Hidden Markov Models [11]. Apart from optimising the principle of
auditory modelling and the methods for speech enhancement, finding alternative
model architecture that applies Dynamic Bayesian Network structures which
differ from the statistic assumptions of HMM modelling is an active area of
research and a promising approach to improve noise robustness [12, 14, 41].
Generative models like the Hidden Markov Model are
restricted in a way that they assume that the speech feature observations are
conditionally independent. This can be considered as drawback as the
restriction ignores long-range dependencies between observations. On the
contrary, the Conditional Random Fields (CRFs) introduced in [42] use an
exponential distribution to model a sequence, given the observation sequence.
In order to estimate the conditional probability of a class for an entire
sequence, the Hidden Conditional Random Field [12]
incorporates hidden state sequences.
Other model architecture like Long Short-Term Memory
Recurrent Neural Networks [43] which, in contrast to
conventional Recurrent Neural Networks, consider long-range dependencies between
the observations was recently proven to be well suited for speech recognition
[44]. Even static classifiers like Support Vector Machines have been
successfully applied in isolated word recognition tasks [45], where a
warping of the observation sequence is less essential than in continuous speech
recognition.
An alternative to the feature-based HMM has been
proposed in [13] where the raw speech signal is modelled in the time domain.
In clean conditions, methods based on raw signal modelling like the Switching
Autoregressive HMM [13] work well; however, the performance quickly degrades
whenever the technique is used in noisy surroundings. To improve noise
robustness, [14] extended the SAR-HMM to a Switching Linear Dynamical System
(SLDS) which includes an explicit noise model by modelling the dynamics of both
the raw speech signal and the noise.
2.5. Model Adaptation
Not only joint
speech and noise modelling but also training with noisy data can incorporate
information about potential signal distortion in the recognition process.
Experiments as done in [46] prove that recognition results are
highly dependent on how much the used training material reveals about the
characteristics of possible background noise during a test phase. Depending on
how similar the noise conditions for training and testing are, we can
distinguish between low, medium, and highly matched conditions training.
Multiconditions training refers to using training material with different noise
types. In real world, applications matching the conditions of training and
testing phase are only possible if information about the noise conditions in
which the recogniser will be used is available, for example, during the design
of an in-car speech recogniser as shown herein.
Apart from adapting models by using noisy training
material, the research area of model adaptation also covers widely used
techniques such as maximum a posteriori (MAP) estimation [47], maximum
likelihood linear regression (MLLR) [48], and minimum classification error
linear regression (MCELR) [49].
3. Speech Signal Preprocessing
3.1. Advanced Front-End Feature Extraction
In the
advanced front-end feature extraction (AFE) algorithm outlined in [22], noise
reduction is performed before the cepstral features are calculated. The main
steps of the algorithm can be seen in Figure 1. After noise reduction, the
denoised waveforms are processed, and the cepstral features are calculated.
Finally blind equalisation is applied to the features.
Figure 1: Feature extraction according to ETSI ES 202 050
V1.1.5.
The preprocessing algorithm for noise reduction is
based on a two-stage Wiener filtering concept. The denoised output signal of
the first stage enters a second stage where an additional dynamic noise
reduction is performed. In contrast to the first filtering stage, a gain
factorisation unit is incorporated in the second stage to control the intensity
of filtering dependent on the signal-to-noise ratio (SNR) of the signal. The
components of the two noise reduction cycles are illustrated in Figure 2.
First, the input signal is divided into frames. After estimating the linear
spectrum of each frame, the power spectral density (PSD) is smoothed along the
time axis in the PSD Mean block. A voice activity detector (VAD) determines
whether a frame contains speech or background noise, and so both the estimated
spectrum of the speech frames and the estimated noise spectrum are used to
calculate the frequency domain Wiener filter coefficients. To get a Mel-warped
frequency domain Wiener filter, the linear Wiener filter coefficients are
smoothed along the frequency axis using a Mel-filterbank. The Mel-warped
Inverse Discrete Cosine Transform (Mel IDCT) unit calculates the impulse
response of the Wiener filter before the input signal is filtered and passes
through a second noise reduction cycle. Finally, the constant component of the
filtered signal is removed in the “OFF” block.
Figure 2: Two-stage Wiener filtering for noise reduction
according to ETSI ES 202 050 V1.1.5.
Focusing on the Wiener filter approach as part of the
advanced front-end feature extraction algorithm, a great advantage with respect
to other preprocessing techniques for enhanced noise robustness is that noise
reduction is performed on a frame-by-frame basis. The Wiener filter parameters
can be adapted to the current SNR which makes the approach applicable to
nonstationary noise. However, a critical issue of the AFE technique is that it
relies on exact voice activity detection—a precondition that can be difficult
to fulfil, especially if the SNR level is negative like in our in-car speech
recognition problem (cf. Section 6.). Further, compared with other noise
compensation strategies, the AFE is a rather complex mechanism and sensible to
errors and inaccuracies within the individual estimation and transformation
steps.
3.2. Unsupervised Spectral Subtraction
Another
technique of speech enhancement known as Unsupervised Spectral Subtraction had
been developed in [17]. This Spectral Subtraction scheme relies on a two-mixture
model approach of noisy speech and aims to distinguish speech and background
noise at the magnitude spectogram level.
3.2.1. Mixture Model
To derive a probabilistic model for speech distorted
by noise, a probability distribution for both speech and noise is needed. When
modelling background noise on silent parts of the time-frequency plane, it is
common to assume white Gaussian behaviour for real and imaginary parts
[50, 51]. In the magnitude domain, this
corresponds to a Rayleigh probability density function for
noise:
Apart from the Rayleigh silence model, a speech model
for “activity” that models large magnitudes only has to be derived to
obtain the two-mixture model. For the speech probability density function , a threshold is defined with
respect to the noise distribution , so that only magnitudes are modelled.
In [17], a threshold is used,
whereas is the mode of
the Rayleigh PDF. Consequently, we assume that magnitudes below are background
noise. Two further constraints are necessary for .
(i)The derivative of the
“activity” PDF may not be zero when is just above ; otherwise, the threshold has no meaning
since it can be set to an arbitrarily low value.(ii) As goes towards
infinity, the decay of should be lower
than the decay of the Rayleigh PDF to ensure that models large
amplitudes.
The “shifted
Erlang” PDF with [52]
fulfils these two criteria and, therefore, can be used to model large amplitudes
which are assumed to be speech: with if and , otherwise.
The overall probability density function for the
spectral magnitudes of the noisy speech signal is given as follows:
is the prior
for “silence” and background noise, respectively, whereas is the prior
for “activity” and speech, respectively. All the parameters of the
derived PDF summarised in
the parameter setare independent of time and
frequency.
3.2.2. EM Training of Mixture Parameters
The parameters of the
two-mixture model can be trained using an Expectation Maximisation (EM)
training algorithm [53]. In the “Expectation” step, the posteriors are
estimated as follows:
For the “Maximisation” step, the moment method
is applied: all data is used to update before all data
with values above the new is used to
update . The method can be described by the following two
update equations:
3.2.3. Spectral Subtraction
After the training of all mixture parameters , Unsupervised Spectral Subtraction is applied using
the parameter as floor
value:
Flooring to a nonzero value is necessary whenever MFCC
features are used, since zero magnitude values after spectral subtraction would
lead to unfavourable dynamics in the cepstral coefficients.
Overall, USS is a simple and computationally efficient
preprocessing strategy, allowing unsupervised EM fitting on observed data. A
weakness of the approach is that it relies on appropriately estimating a speech
magnitude PDF which is a difficult task. Since the PDFs do not depend on
frequency and time, the applicability of USS is restricted to stationary
noises. USS only models large magnitudes of speech so that low speech
magnitudes cannot be distinguished from background noise.
4. Feature Enhancement
4.1. Feature Normalisation
4.1.1. Cepstral Mean Subtraction
A simple
approach to remove the effects of noise and transmission channel transfer
functions on the cepstral representation of speech is Cepstral Mean Subtraction
[7, 54]. In many surroundings, for example, in a car where the speech
signal is superposed by engine noise, the noise source can be considered as
stationary, whereas the characteristics of the speech signal change relatively
fast. Thus, a goal of preprocessing techniques for speech enhancement is to
remove the stationary part of the input signal. As this quasi-non-varying part
of the signal corresponds to a constant global shift in the cepstrum, speech
can usually be enhanced by subtracting the long-term average cepstral
vector from the received distorted cepstrum vector sequence
of length :
Consequently, we get a new estimate of the signal
in the cepstral domain:
This method also exploits the advantage of MFCC speech
representation: if a transmission channel is inserted on the input speech, the
speech spectrum is multiplied by the channel transfer function. In the
logarithmic cepstral domain, this multiplication becomes an addition which can
easily be removed by subtracting the cepstral mean from all input vectors.
However, unlike techniques like Histogram Equalisation, CMS is not able to
treat nonlinear effects of noise.
4.1.2. Mean and Variance Normalisation
Subtracting
the mean of each feature vector component from the cepstral vectors (as done in
CMS) corresponds to an equalisation of the first moment of the vector sequence
probability distribution. In case noise also affects the variance of the speech
features, a preprocessing stage for speech enhancement can profit also from
normalising the variance of the vector sequence which corresponds to an
equalisation of the first two moments of its probability distribution. This
technique is known as Mean and Variance Normalisation and results in an
estimated feature vectorwhere the division by the vector , which contains the standard deviations of the
feature vector components, is carried out elementwise. After MVN, all features
have zero mean and unity variance.
4.1.3. Histogram Equalisation
Histogram
Equalisation is a popular technique for digital image processing where it aims
to increase the contrast of pictures. In speech processing, HEQ can be used to extend
the principle of CMS and MVN to all moments of the probability distribution of
the feature vector components [9, 55]. It enhances noise robustness by
compensating nonlinear distortions in speech representation caused by noise and
therefore reduces the mismatch between test and training data.
The main idea is to map the histogram of each
component of the feature vector onto a reference histogram. The method is based
on the assumption that the effect of noise can be described as a monotonic
transformation of the features which can be reversed to a certain degree. As
the effectiveness of HEQ is strongly dependent on the accuracy of the speech
feature histograms, a sufficiently large number of speech frames have to be
involved to estimate the histograms. An important difference between HEQ and
other noise reduction techniques like Unsupervised Spectral Subtraction is that
no analytic assumptions have to be made about the noise process. This makes HEQ
effective for a wide range of different noise processes independent of how the
speech signal is parameterised.
When applying HEQ, a transformationhas to be found in order to
convert the probability density function of a certain
speech feature into a reference probability density function . If is a
unidimensional variable with probability density function , a transformation leads to a
modification of the probability distribution, so that the new distribution of
the obtained variable can be
expressed as with being the
inverse transformation of . To obtain the cumulative probabilities out of the
probability density functions, we have to consider the following
relationship:
Consequently, the transformation converting the
distribution into the
desired distribution can be
expressed aswhere is the inverse
cumulative probability function of the reference distribution, and is the
cumulative probability function of the feature. To obtain the transformation
for each feature vector component in our experiments, 500 uniform intervals
between and were considered
to derive the histograms, with and representing
the mean and the standard deviation of the th feature vector
component. For each component, a Gaussian probability distribution with zero
mean and unity variance was used as reference probability distribution.
Summing up the three feature normalisation strategies,
CMS is the most simple and common technique which, however, cannot treat
nonlinear effects of noise. MVN constitutes an improvement but still it only
provides a linear transformation of the original variable. By contrast, HEQ
compensates also nonlinear distortions. However, its effectiveness and accuracy
heavily depend on the quality of the estimated feature histograms in a way that
numerous speech frames are needed before HEQ can be expected to work well.
Furthermore, Histogram Equalisation is intended to correct only monotonic
transformations but the random behaviour of noise makes the actual transformation
nonmonotonic which causes a loss of information.
4.2. Model-Based Feature Enhancement
Model-based
speech enhancement techniques are based on modelling speech and noise. Together
with a model of how speech and noise produce the noisy observations, these models
are used to enhance the noisy speech features. In [10], a
Switching Linear Dynamic Model is used to capture the dynamics of clean speech.
Similar to Hidden Markov Model-based approaches to model clean speech, the SLDM
assumes that the signal passes through various states. Conditioned on the state
sequence, the SLDM furthermore enforces a continuous state transition in the
feature space.
4.2.1. Modelling of Noise
Unlike speech,
which is modelled applying an SLDM, the modelling of noise is done by using a
simple Linear Dynamic Model obeying the following system
equation:
Thereby the matrix and the vector simulate how
the noise process evolves over time, and represents a
Gaussian noise source driving the system. A graphical representation of this
LDM can be seen in Figure 3. As LDMs are time-invariant, they are suited to
model signals like coloured stationary Gaussian noises as they occur in the
interior of a car. Alternatively to the graphical model in Figure 3, the
equations can be used to express the LDM.
Figure 3: Linear dynamic model for noise.
Here, is a
multivariate Gaussian with mean vector and covariance
matrix , whereas denotes the
length of the input sequence.
4.2.2. Modelling of Speech
The modelling
of speech is realised by a more complex dynamic model which also includes a
hidden state variable at each time . Now and depend on the
state variable :
Consequently, every possible state sequence describes an
LDM which is nonstationary due to and changing over
time. Time-varying systems like the evolution of speech features over time can
be described adequately by such models. As can be seen in Figure 4, it is
assumed that there are time dependencies among the continuous variables but not among
the discrete state variables . This is the major difference between the SLDM used
in [10] and the models used in [38] where time dependencies among the hidden
state variables are included. A modification like this can be seen as analogous
to extend a Gaussian Mixture Model (GMM) to an HMM. The SLDM corresponding to
Figure 4 can be described as follows:
Figure 4: Switching linear dynamic model for speech.
To train the parameters , , and of the SLDM,
conventional EM techniques are used. Setting the number of states to one
corresponds to training a Linear Dynamic Model instead of an SLDM to obtain the
parameters , , and needed for the
LDM which is used to model noise.
4.2.3. Observation Model
In order to
obtain a relationship between the noisy observation and the hidden speech and
noise features, an observation model has to be defined. Figure 5 illustrates
the graphical representation of the zero variance observation model with SNR
inference introduced in [56]. Thereby it is assumed that speech and noise mix linearly in
the time domain corresponding to a nonlinear mixing in the cepstral domain.
Figure 5: Observation model for noisy speech .
4.2.4. Posterior Estimation and Enhancement
A possible
approximation to reduce the computational complexity of posterior estimation is
to restrict the size of the search space applying the generalised
pseudo-Bayesian (GPB) algorithm [57]. The GPB algorithm is based on
the assumption that the distinct state histories whose differences occur more
than frames in the
past can be neglected. Consequently, if denotes the
length of the sequence, the inference complexity is reduced from to whereas . Using the GPB algorithm, the three steps “collapse,” “predict,” and “observe” are conducted for each speech frame.
The Gaussian posterior obtained in the observation
step of the GPB algorithm is used to obtain estimates of the moments of . Those estimates represent the denoised speech
features and can be used for speech recognition in noisy environments. Thereby
the clean features are assumed to be the Minimum Mean Square Error (MMSE)
estimate .
Due to the noise modelling assumptions, SLDM feature
enhancement has shown excellent performance also for coloured Gaussian noise
even if the SNR level is negative. The linear dynamics of the speech model
capture the smooth time evolution of human speech, while the switching states
express the piecewise stationarity. The major limitation with respect to the
noise type is that the model assumes the noise frames to be independent over
time, so that only stationary noises are modelled accurately. Despite the GPB
algorithm, SLDM feature enhancement is relatively time-consuming compared to
simpler feature processing algorithms such as Histogram Equalisation. Another
drawback is that the whole concept relies on precise voice activity detection
in order to detect feature frames for the estimation of the noise LDM.
5. Model Architecture
5.1. Speech Modelling in the Feature Domain
To allow
efficient speech modelling, it is common to model features extracted from the
speech signal every 10 milliseconds instead of using the signal in the time
domain as described in Section 5.2. As an alternative to conventional HMM
modelling, the Hidden Conditional Random Field [58] will be
introduced in the following and examined with respect to its noise robustness
in Section 6.3.
5.1.1. Hidden Markov Models and Conditional Random Fields
Generative
models like the Hidden Markov Model assume that the observations are
conditionally independent, meaning that an observation is statistically
independent of past observations provided that the values of the latent
variables are known. Whenever there are long-range dependencies between the observations,
like in human speech [30], this restriction can be too strict.
Therefore, model architecture like the Conditional Random Field [42, 59, 60] makes use
of an exponential distribution in order to model a sequence, given the
observation sequence, and thereby drop the independence assumption between
observations. Nonlocal dependencies between state and observation as well as
unnormalised transition probabilities are allowed. As a Markov assumption can
still be enforced, efficient inference techniques like dynamic programming can
also be applied when using Conditional Random Fields. CRFs have been
successfully applied in various tasks like information extraction [42] or
language modelling [61].
5.1.2. Hidden Conditional Random Fields
As CRFs assign
a label for each observation and each frame of a time-sequence, respectively,
and, therefore, cannot directly estimate the probability of a class for an entire
sequence, they need to be modified in order to be applicable for speech recognition
tasks. Hence, the CRF has been extended to a Hidden Conditional Random Field
which incorporates hidden state sequences [58]. The HCRF was successfully applied
in various pattern recognition problems like Phone Classification [12],
Gesture Recognition [62], Meeting Segmentation [63], or recognition of nonverbal
vocalisations [64] where it partly outperformed HMM approaches. An advantage of
HCRF is the ability to handle features that are allowed to be arbitrary
functions of the observations while not requiring a more complicated training.
Similar to an HMM, the HCRF is used to model the
conditional probability of a class label representing a
word, given the sequence of observations . With denoting the
parameter vector and being the
so-called vector of sufficient statistics, the conditional probability
is
represents the
hidden state sequence that is run through while the conditional probability is
calculated. The normalisation of the probability is realised by the function which
is
The vector determines
which probability to model, whereas can be chosen
in a way that the HCRF imitates a left-right HMM as shown in [12]. We
restrict the HCRF to be a Markov chain; however the transition probabilities do
not have to sum to one and the observations do not need to be real probability
densities.
Like an HMM, an HCRF can be parameterised by
transition scores and observation
scores :
The conditional probability can efficiently be
computed when using forward and backward recursions as derived for the HMM. The
forward probability is given as where is the number
of hidden states. The backward probabilities can be obtained
by using the recursion
Given the forward probabilities , the probability that the model
with parameters representing
the word produces
observation can be written
as
The conditional probability of a class label given the
observation is
This HCRF definition makes it possible to use dynamic
programming methods for decoding as with HMM. As shown in [12], a
conditional probability density as for an HMM with transition probabilities , emission means, and covariances and , respectively, can be obtained by setting the
parameters as
follows:
Thereby denotes the
dimension of the -dimensional
observation, whereas and are states of
the model. For the sake of simplicity, (27) to (30) consider only one mixture
component. The extension to additional mixtures is straightforward.
5.2. Speech Modelling in the Time Domain
An alternative
to conventional HMM modelling of speech is the modelling of the raw signal
directly in the time domain. As proven in [13], modelling the raw signal can
be a reasonable alternative to feature-based approaches. Such architecture
offers the advantage that including an explicit noise model is straightforward,
as can be seen in Section 5.2.2.
5.2.1. Switching Autoregressive Hidden Markov Models
In [14], a
Switching Autoregressive HMM is applied for isolated digit recognition. The
SAR-HMM is based on modelling the speech signal as an autoregressive (AR)
process, whereas the nonstationarity of human speech is captured by the
switching between a number of different AR parameter sets. This is done by a
discrete switch variable that can be
seen as analogon to the HMM states. One of different
states can be occupied at each time step . Thereby, the state variable indicates which AR
parameter set to use at the given time instant . Here, the time index denotes the
samples in the time domain and not the feature vectors as in Section 4.2. The
current state only depends on the preceding state with transition probability . Furthermore, it is assumed that the current sample is a linear
combination of the preceding
samples superposed by a Gaussian distributed innovation . Both and the AR
weights depend on the
current state : with
The purpose of is not to model
an independent additive noise process but to model variations from pure
autoregression. For the SAR-HMM, the joint probability of a sequence of length is corresponding to the Dynamic Bayesian Network (DBN)
structure illustrated in Figure 6.
Figure 6: Dynamic bayesian network structure of the
SAR-HMM.
As the number of samples in the time domain which are
used as input for the SAR-HMM is usually a lot higher than the number of
feature vectors observed by an HMM, it is necessary to ensure that the
switching between the different AR models is not too fast. This is granted by
forcing the model to stay in the same state for an integer multiple of time steps.
The training of the AR parameters is realised by
applying the EM algorithm. To infer the distributions , a technique based on the forward-backward algorithm
is used. Due to the fact that an observation depends on preceding
observations (see Figure 6), the backward pass is more complicated for the
SAR-HMM than for a conventional HMM. To overcome this problem, a “correction
smoother” as derived in [65] is applied which means that the
backward pass computes the posterior by
“correcting” the output of the forward pass.
5.2.2. Autoregressive Switching Linear Dynamical Systems
To improve
noise robustness, the SAR-HMM can be embedded into an AR-SLDS to include an
explicit noise process as shown in [14]. The AR-SLDS interprets the
observed speech sample as a noisy
version of a hidden clean sample. Thereby, the clean signal can be obtained
from the projection of a hidden vector which has the
dynamic properties of a Linear Dynamical System as follows: withThe dynamics of the hidden
variable are defined by the transition matrix which depends
on the current state . Variations from pure linear state dynamics are
modelled by the Gaussian distributed hidden “innovation” variable . Similar to the variable used in (31)
for the SAR-HMM, does not model an independent additive noise source. To obtain the current observed
sample, the vector is projected
onto a scalar as
follows: withThe variable thereby models
independent additive white Gaussian noise which is supposed to corrupt the
hidden clean sample . Figure 7 visualises the structure of the SLDS
modelling the dynamics of the hidden clean signal as well as independent
additive noise.
Figure 7: Dynamic bayesian network structure of the
AR-SLDS.
The SLDS parameters , , and can be defined
in a way that the obtained SLDS mimics the SAR-HMM derived in Section 5.2.1
for the case (see [14]). This
has the advantage that in case a noise model
is included without having to train new models. Since inference calculation for
the AR-SLDS is computationally intractable, the “Expectation Correction”
algorithm developed in [66] is applied to reduce the complexity. In contrast to
the exact inference which requires , the passes performed by the Expectation Correction
algorithm are linear in .
While the SAR-HMM has shown rather poor performance in
noisy conditions, the AR-SLDS achieves excellent recognition rates for speech
disturbed by white noise, as the variable incorporates an
additive white Gaussian noise (AWGN) model. In clean conditions, however, the
performance of HMM speech modelling in the feature domain cannot be reached by
the AR-SLDS, since time domain modelling is not as close to the principle of
human perception as the well-established MFCC features. Also for coloured
noise, the AR-SLDS cannot compete with feature domain approaches such as the
SLDM. Further, computational complexity is still very high for the AR-SLDS. The
Expectation Correction algorithm can reduce complexity from to ; however, for a speech utterance sampled at 16 kHz, is 160 times
higher than for a feature vector sequence extracted every 10 milliseconds.
6. Experiments
In order to
compare the different speech signal preprocessing, feature enhancement, and
speech modelling techniques introduced in Sections 3 to 5 with respect to their
recognition performance in various noise scenarios, we implemented all of the
techniques in a noisy speech recognition experiment which will be outlined in the
following.
6.1. Speech Database
The digits
“zero” to “nine” as well as the letters “A” to “Z” from
the TI 46 Speaker Dependent Isolated Word Corpus [67] are
used as speech database for the noisy digit and spelling recognition task. The
database contains utterances from 16 different speakers—8 female and 8 male
speakers. For the sake of better comparability with the results presented in
[14], only the words which are spoken by male speakers are used.
For every speaker, 26 utterances were recorded per word class, whereas 10
samples are used for training and 16 for testing. Consequently, the overall
digit training corpus consists of 800 utterances, while the digit test set
contains 1280 samples. The same holds for the spelling database, consisting of
2080 utterances for training and 3328 for testing.
6.2. Noise Database
Even though we
also considered babble and white noise scenarios, the main focus of this work
lies on designing a robust speech recogniser for an in-car environment. Thus,
great emphasis has been laid on simulating a wide spectrum of different noise
conditions that can occur in the interior of a car. In general, interior noise
can be split up into four rough groups. The first one is wind noise which is
generated by air turbulence at the corners and edges of the vehicle and arises
equivalently to the velocity. Another noise type is engine noise depending on
load and number of revolutions. The third noise group is caused by wheels,
driving, and suspension and is influenced by road surface and wheel type. Thus
a rough surface causes more wheel and suspension noise than a smooth one.
Finally, buzz, squeak, and rattles generated by pounding or relative movement
of interior components of a vehicle have to be considered [68].
According to existing in-car speech recognition
systems, the microphone would be mounted in the middle of the instrument panel.
Consequently, all masking noises occurring in the interior of a car have been
recorded exactly at the same point. Figure 8 illustrates the different noise
sources. Note that the mouth-to-microphone transfer function had been neglected
during the experiments in Section 6.3, since the masking effect of background
noise was proven to be much higher than the effect of convolutional noise. In
an additional experiment, the slight degradation of recognition performance in
case of a convolution of the speech signal with a recorded in-car impulse
response could be perfectly compensated by simple Cepstral Mean Subtraction.
Figure 8: In-car speech and masking sound (top) and information
flow (bottom).
As interior noise masking varies depending on vehicle
class and derivates [68], speech is superposed by noise of four different vehicles as
they are listed in Table 1.
Table 1: Considered vehicles.
Thus, a wide spectrum of car variations can be
covered. Not only the vehicle type but also the road surface influences the
characteristics of interior noise. Hence, three different surfaces in
combination with typical velocities have been considered as shown in Table 2.
The lowest excitation provides a driving over a smooth city road at 50 km/h and
medium revolution (CTY). Thus, at this profile noise caused by wind, engine,
wheels, and so forth has its minimum. The subsequent higher excitation is
measured for a highway drive at 120 km/h (HWY). In that case, wind noise is a
multiple higher than for a drive at 50 km/h. The worst and loudest sound in the
interior of a car provokes a road with big cobbles (COB). At 30 km/h, wind
noise can be neglected but the rough cobble surface involves dominant wheel and
suspension noise. Figure 9 shows the SNR histograms of the noisy speech
utterances for all four car types at each driving condition.
Table 2: Considered road surfaces and velocities.
Figure 9: SNR level histograms for noisy speech utterances.
In spite of SNR levels below 0 dB, speech in the noisy
test sequences is still well audible since the recorded noise samples are
lowpass signals with most of their energy in the frequency band from 0 to 500 Hz (see Figure 10). Consequently, there is no full overlap of the spectrum of
speech and noise. The extremely low SNR levels for the car noises (see Figure
9) are mainly caused by intense spectral components below the spectrum of human
speech (motor drone). Filtering out those spectral components did not
significantly affect recognition performance. Note that no A-weighting had been
applied to estimate the SNR levels.
Figure 10: Long-term spectrum of the car noises COB, HWY, CTY
(Mini Cooper S) and the spectral characteristics of the vowel [i:] spoken by a
male speaker.
Apart from car noises (CAR), two further noise types
are used in our experiments: first, a mixture of babble and street noise (BAB)
at SNR levels 12 dB, 6 dB, and 0 dB, recorded in downtown Munich. This noise
type is relevant for in-car speech recognition performance when driving with in
an urban area with open windows. Furthermore, additive white Gaussian noise
(WGN) has been used (SNR levels 20 dB, 10 dB, and 0 dB).
Note that heating, ventilating, and air conditioning
(HVAC) noise was not examined as further potential noise source that can occur
inside a car, since fan and defrost facilities were turned off during noise
recording. Although it is quite evident that such additional in-car noises can
further degrade speech recognition performance, we abstained from varying fan
and defrost settings as those noise types can be characterised as stationary
and are likely to not change the ranking of the individual noise compensation
strategies but rather result in a negative “performance offset.”
Contrariwise, the Lombard effect, which causes humans
to speak louder when background noise is present, was also not considered since
this would mostly result in a constant shift of the SNR histogram (Figure 9)
towards higher SNR levels, without affecting conclusions about the
effectiveness of the different denoising strategies.
6.3. Results
For every digit, a model was trained to build an
isolated word recogniser. In the case of HMM and HCRF, each model consists of
eight states with a mixture of three Gaussians per state. Thereby, clean
utterances were used for training. 13 Mel-frequency cepstral coefficients as
well as their first- and second-order derivatives were extracted. In addition,
the usage of PLP features instead of MFCC was evaluated. Attempting to remove
the effects of noise, various speech enhancement strategies as outlined in
Section 4. were applied: Cepstral Mean Subtraction, Mean and Variance
Normalisation, Histogram Equalisation, Unsupervised Spectral Subtraction, and
Advanced Front-End feature extraction. In most of the experiments, the
recognition rate for clean speech was around 99.9%. All parameters were tuned
to achieve the best possible recognition performance.
As can be seen in Table 3, for stationary lowpass
noise like the “CAR” and “BAB” noise types, the best average
recognition rate can be achieved when enhancing the speech features using a
global Switching Linear Dynamic Model for speech and a Linear Dynamic Model for
noise (see Section 4.2). Thereby, all available clean training sequences were
used to train the global SLDM which captures the dynamics of clean speech. The
speech model consisted of 32 hidden states. The utterance-specific noise model
consisted of a single Gaussian mixture component and was trained on the first
and last 10 frames of the noisy test utterance. To speed up the calculation, the
algorithm for speech enhancement was run with history parameter (see Section
4.2.4). Also for more demanding recognition tasks like the Interspeech
Consonant Challenge [69], SLDM feature enhancement was proven to increase
recognition rates for noisy speech. The technique cannot compete with
strategies using perfect knowledge of the local SNR of time-frequency
components in the spectrogram like oracle masks [70–72]; however, compared to the
Consonant Challenge HMM baseline recogniser [69], the SLDM approach can improve noisy speech
recognition rates by up to 174% [73].
Table 3: Mean-isolated digit recognition rates in (%) for
different noise types, noise compensation strategies, and features (training on
clean data), sorted by mean recognition rate.
Applying Hidden Conditional Random Fields instead of
HMM for the classification of features enhanced by CMS did not result in a
better recognition rate.
For speech disturbed by white noise, the best
recognition rate (93.3%, averaged over the different SNR conditions) is reached
by the autoregressive Switching Linear Dynamical System explained in Section
5.2.2, where the noisy speech signal is modelled in the time domain as an
autoregressive process. As explained in Section 5.2.2, the AR-SLDS constitutes
the fusion of the SAR-HMM with the SLDS. The AR-SLDS used in the experiment is
based on a 10th order SAR-HMM with ten states. This concept is however not
suited for lowpass noise at negative SNR levels: for the “CAR” noise type
a poor recognition rate of 47.2%, averaged over all car types and driving
conditions, was obtained for AR-SLDS modelling. A reason for this is the
assumption in (36) which expects additive noise to have a flat spectrum.
In case an HMM recogniser without feature enhancement
is applied, PLP features perform slightly better than MFCC.
For white Gaussian noise, Table 4 compares the
recognition rates obtained in this work with the performance reported in
[14], using Unsupervised Spectral Subtraction, SAR-HMM and
AR-SLDS modelling. Note that we used only 10 digits in our experiment (“zero”
to “nine”), while [14] used 11 digits (including “oh”), which, together with
extensive parameter tuning, should be the major reason why our SAR-HMM and
AR-SLDS performance is better.
Table 4: Isolated digit recognition rates in (%) for different
SNR levels (white Gaussian noise) and noise compensation strategies (training
on clean data); comparison between the results obtained in this work and the
results reported in [
14].
Table 5 summaries the mean recognition rates of an HMM
recogniser without feature enhancement for three different training strategies:
training on clean data, mismatched conditions training, and matched conditions
training. Here, mismatched conditions training denotes the case when training
and testing is done using speech sequences disturbed by the same noise type but
at unequal noise conditions (SNR levels and driving conditions, resp.). Matched
conditions training means training and testing with exactly identical noise
types and noise conditions. Whenever the test sequence is disturbed by noise,
mismatched conditions training outperforms a recogniser that had been trained
on clean data. However, the main drawback of this approach is that for clean
test sequences the mismatched conditions training strategy significantly
downgrades recognition rates since in this case the noise pattern that had been
learned during the training is missing when testing the recogniser. The results
for matched conditions training serve as an upper benchmark for noisy speech
recognition performance, as this strategy assumes perfect knowledge of the
noise properties. Note that since in the matched conditions experiment one
model was trained for every noise condition, this not only implies knowledge of
the noise characteristics (e.g., by considering GPS or velocity information)
but also higher memory requirements, as more than one model has to be stored.
In the in-car scenario, this would entail one model for every driving
condition, resulting in an increase of model size by factor four.
Table 5: Mean isolated digit recognition rates in (%) of an
HMM recogniser without feature enhancement for different noise types and
training strategies: matched conditions (MC) training, mismatched conditions
(MMC) training, and training with clean data.
The best MFCC feature enhancement methods were also
applied in the spelling recognition task (see Table 6). Again, for noisy test
data, SLDM performs better than conventional techniques like HEQ.
Table 6: Mean spelling recognition rates in (%) for different
noise types and noise compensation strategies (training on clean data).
7. Conclusion
In this
article, a wide range of different techniques to improve the performance of
automatic speech recognition in noisy surroundings has been implemented and
evaluated in a noisy in-car isolated digit and spelling recognition task. In
contrast to previous researches, diverse cars and driving conditions resulting
in different spectral noise characteristics have been taken into account in
order to obtain reliable conclusions about the universality of recognition
performance. Thereby, four major approaches, affecting feature extraction,
feature enhancement, speech decoding, and speech modelling, have been
considered.
Aiming to approximate the speech recognition
performance of human perception in noisy conditions, the use of PLP features as
speech representation leads to a relative error reduction of 18.6% (averaged
over all evaluated noise conditions) with respect to conventional MFCC.
Furthermore, we proved that feature enhancement methods based on spectral
subtraction and normalisation like Cepstral Mean Subtraction, Mean and Variance
Normalisation, Unsupervised Spectral Subtraction, or Histogram Equalisation are
able to partly remove the effects of stationary coloured noises as they occur
in the interior of a car.
As a further approach to enhance speech features, a
global Switching Linear Dynamic Model was used to capture the dynamics of
speech enabling a model-based speech enhancement through joint speech and noise
modelling. This technique prevailed for all car noise types and reached the
best mean recognition rate of 96.9% for the noisy isolated digit recognition
task.
The usage of Hidden Conditional Random Fields as an
alternative model architecture could not outperform the conventional HMM.
However, embedding a Switching Linear Dynamical System into a Switching
Autoregressive HMM, and thereby modelling the raw signal in the time domain,
leads to the best recognition performance for speech corrupted with additive
white Gaussian noise.
Adapting the speech models by using noisy training
data to build the models could also improve noise robustness. While matched
conditions training is hardly possible in real life applications since the
exact noise condition is not known a priori, mismatched conditions training,
which uses training sequences disturbed by a noise type different from that in
the test phase, outperformed training on clean data with a relative error
reduction of 54.5%.
Apart from recognition performance, also computational
complexity and possible fields of application have to be considered when
designing a robust speech recogniser. While AFE and USS are more complex than
feature normalisation techniques such as CMS or MVN, they are still suited for
real-time applications. HEQ and SLDM feature enhancements achieve better
recognition rates but require more computational resources. Modelling the
speech signal in the time domain as done in the AR-SLDS experiment requires the
most computational power and is therefore not suited for most real-life
applications. For stationary noises, the SLDM is the most promising technique;
however, it relies on accurate voice activity detection.
To optimise existing denoising strategies, future
research effort could be spent on increasing the suitability of promising
concepts like SLDM feature enhancement for the in-car speech recognition task
by including discrete state transition probabilities or finding the optimum
compromise between an increment of the history parameter and computational
complexity. Furthermore, the AR-SLDS concept could be optimised for coloured
noise to improve recognition performance when applying autoregressive speech
modelling for in-car speech recognition. It might be also interesting how the
implemented denoising methods perform in a continuous speech recognition task
where, due to longer observation sequences, the parameters of a global SLDM as
well as the cumulative histogram for the HEQ method could be estimated more
precisely than in an isolated digit or spelling recognition experiment. Further
improvements in noise robustness could also be achieved by combining different
denoising concepts or by the application of other promising modelling concepts
like Long Short-Term Memory Recurrent Neural Networks.
Speech recognition in noisy environments remains
challenging; however, as shown in this article, spending effort on finding
accurate techniques for auditory modelling, feature enhancement, speech
modelling, and model adaption can remarkably reduce the performance gap between
automatic speech recognition and human perception.
Acknowledgments
The authors would like to thank Jasha Droppo and
Bertrand Mesot for providing SLDM and AR-SLDS binaries. The research leading to
these results has received funding from the European Community's Seventh
Framework Programme (FP7/2007-2013) under Grant agreement no. 211486 (SEMAINE).