Abstract

Speech enhancement in a vehicle environment remains a challenging task for the complex noise. The paper presents a feature extraction method that we use interchannel attention mechanism frame by frame for learning spatial features directly from the multichannel speech waveforms. The spatial features of the individual signals learned through the proposed method are provided as an input so that the two-stage BiLSTM network is trained to perform adaptive spatial filtering as time-domain filters spanning signal channels. The two-stage BiLSTM network is capable of local and global features extracting and reaches competitive results. Using scenarios and data based on car cockpit simulations, in contrast to other methods that extract the feature from multichannel data, the results show the proposed method has a significant performance in terms of all SDR, SI-SNR, PESQ, and STOI.

1. Introduction

In the process of driving, the speech signals recorded by a microphone are often corrupted by reverberation and background noise, such as wind noise, engine noise, and tire noise, leading to considerable degradation in speech quality, particularly at low signal-to-noise ratios (SNRs) [1]. Speech enhancement technology can improve the speech quality of the interphone system and the ability of the speech recognition system. Multichannel enhancement in vehicle scenarios uses microphone arrays that are convenient and flexible for speech-enabled applications [2]. The multichannel structure could provide more spatial information from the interchannel data and better results than the signal channel.

Although the technology of microphone array has been developed for a long time, multichannel speech enhancement is still a great challenge in the field of speech recognition. The methods can be divided into two categories: one is based on the frequency domain, and the other is based on the time domain. Researchers mostly use the frequency-domain methods, which are based on the short-time spectrum estimation. Chakrabarty and Habets [3] proposed a multichannel online speech enhancement method based on time-frequency masking. Convolutional recurrent neural network (CRNN) is used to estimate the mask, and the effects of the ideal ratio mask (IRM) and ideal binary mask (IBM) on the results are discussed. The results show that the method is robust to different angles of sound sources. In [4], a multichannel speech enhancement system based on a deep neural network is proposed. Firstly, the audio signal is transformed into the frequency domain by STFT, the time-frequency mask is estimated by DNN, and the multichannel Wiener filtering is performed by using the power spectral density of speech and noise. The experimental results show that the method is effective. A beamforming method different from the traditional DNN is proposed in [5]. The spectrum of each channel is mapped to the non-Euclidean space, usually using the phase information to improve real-time performance, and the graph neural network is used for end-to-end training. Compared with the existing methods, the experiment result is better. A time-domain beamforming method named FaSNet (Filter and Sum Network) suitable for the low delay is proposed in [6]. The author selects the reference channel for filtering, calculates the filter of other channels by the reference channel, and then adds the filtered speech of each channel as the denoised speech. The model size of the algorithm is small, and the performance is better than that of traditional beamforming methods. In [7], a streaming speech enhancement system is proposed, which adopts the Wave-U-Net framework, adds temporal convolution and attention mechanism into the encoding and decoding structure, and explores the history caching mechanism. This method achieves almost the same noise reduction effect as the nonstreaming model. The time-domain convolutional denoising autoencoders (TCDAEs) method is proposed in [8]. It is used to learn the mapping structure between noisy speech waveform and clean speech waveform and solve the problem of speech signal delay between different channels effectively. Compared with the traditional denoising autoencoder, the effect has been significantly improved.

The multichannel speech enhancement model has the most significant advantage of obtaining abundant information between channels compared with the single channel. Therefore, for the multichannel model, the way that extracts the spatial features between channels more effectively becomes the key to achieving better performance. In [9], a multichannel convolution sum (MCS) is used to extract features between channels. On the contrary, in [9], inspired by the IPD [10] feature, the interchannel convolution feature (ICD) is proposed. The method is to perform one-dimensional convolution subtraction on a pair of microphones. Based on GCC-PHAT, [6, 11] considered the normalized cross-correlation (NCC) method, which uses cosine similarity to calculate the information between channels. All the above methods achieve better performance improvement for multichannel speech enhancement. To address the problem of speech enhancement in the car cockpit, this paper proposes a novel method based on interchannel attention mechanism frame by frame (IAF), which helps analyse the influence of each channel on speech signal by using the characteristic information of the channel. Moreover, the proposed method also explores interchannel relationships and achieves more information representation on channel structure. It provides a new idea for multichannel speech enhancement based on vehicle environment and can also be applied to smart homes, teleconference, and other scenes.

The main contents of this paper are as follows: Section 1 introduces the related research work in this field. The structure of the multichannel speech enhancement model based on IAF is proposed in Section 2. The algorithm performance in the vehicle environment is evaluated in Section 3. The experimental results of the algorithm on several microphone arrays are analysed and discussed in Section 4, and Section 5 draws the conclusion and points out the future of the research work.

2. Problem Formulation

The proposed method aims to obtain an accurate estimate of the features for all the channels of a single time frame, given the input feature representation of the corresponding frame. The multichannel speech enhancement process of vehicle data is divided into four successive steps. First, spatial features from multichannel data added context information is extracted by IAF. Then, the frame-level beamforming filters are estimated by a well-trained two-stage BiLSTM model using spatial features, and the original waveforms computed by 1-dimensional convolution, for microphones, beamforming filters are estimated. Next, the filters are adopted to filter the noisy speech in every channel, thereby obtaining the beamformed speech. Finally, add the beamformed speech as the denoised speech. The detail is presented in the following sections. A block diagram of the proposed multichannel enhancement framework is shown in Figure 1.

2.1. Data Preprocessing

It is assumed that the input signal corresponding to each microphone is represented as (2). Here, the frame length is , the frameshift is , the total length of the speech signal is , and the total number of frames is :where is the frame index value and is the channel index. indicates that the signal vector of frame t is collected by microphone .

Due to the different distance between each microphone and the sound source, there is a time delay between the signals received by each microphone. Add a context window to make sure the model can capture interchannel delays of signal samples [12]. We add a group of contextual speech information in and define it as :where is the size of the context window and is the signal vector of the microphone containing the context information at frame . The input sequence to these networks consists of past and future time frame.

2.2. Interchannel Attention Mechanism Frame by Frame

We calculate the corresponding weights of different parts of the channel by constructing the score function to describe the transmission characteristics of the signal in the channel. The principle of interchannel attention mechanism frame by frame is shown in Figure 2.

In order to extend the context information , firstly, average pooling is performed in the frame length dimension:

is the average value of microphone at the number of frames . Then, the results are input into multiple fully connected layers:

is the microphone array feature at the frame . is a set of fully connected layers with parameter modified linear unit (PReLU) activation function, is a set of fully connected layers with sigmoid activation function, and the output of and are [128, 64, 128] and [N].\, respectively. Then, input into the softmax activation function:where is a vector whose sum is 1. The final output is obtained by multiplying with and :

presents the speech feature sequence of the -th frame data in the -th channel.

By using the attention mechanism frame by frame of the speech signal in multiple channels, the model could learn the characteristics of each channel and capture spatial features between channels more accurately.

2.3. Two-Stage BiLSTM Network

The two-stage bidirectional LSTM (TsBiLSTM) is used to derive a beamformer as BiLSTM is adopted to estimate the global feature. For the beamformer, the approach aims to improve the SNR without destroying the target speech.

Figure 3 shows the TsBiLSTM architecture employed in this work. We divide the data into blocks, consider using the BiLSTM network model to obtain local and global features of the blocks and establish the timing relationship of the signal, and use the residual connections to alleviate the gradient dispersion problem.

In this work, we combine the speech signal with context information in the first stage. The observed signal can be expressed as follows:

represents the speech features of frame t and represents all the speech features; then, we do the one-dimensional convolution on :where , then divide into S blocks of the same size. Each block presents , and all the blocks will be connected to form a four-dimensional vector .

We transform the shape to and input the first BiLSTM:

The output of BiLSTM passes through the linear layer and GroupNorm operation and then output . Reshape to , add the vector using the residual connection to reduce the problem of gradient disappearance or gradient explosion, and finally obtain .

In the next stage, change into , then input the next BiLSTM, as the first BiLSTM block, and finally, obtain . Because the signals are transmitted to the BiLSTM model in different block forms, we can obtain the local and global features of the signals, respectively:

Then, use the overlap-add operation to convert the segmented block back to the original sequence:where , is the overlap-add method, which means to restore the partitioned data. Then, convolve in two dimensions with the convolution kernel of size set one:where . We perform twice one-dimensional convolution operations on , then use the activation function of Tanh and Sigmoid, respectively, and multiply the results to get the filters for each channel:where , ⊙ is the Hadamard product symbol, is the gating mechanism of filter that controls the output data.

Figure 4 shows the structure of the BiLSTM block. The input layer is the feature vector of noisy speech with dimension 64, which is input into the BiLSTM layer with dimension 128. The output dimension is 256 since bidirectional LSTM is used. Then, input the linear hidden layer of 64, and get the output after the GroupNorm operation.

2.4. Summation

Integrating the signals of multiple channels into one signal output is an important step in the multichannel speech enhancement problem. After passing the signals of each channel through the channel filter, the results obtained are summed and averaged, that is, the final enhanced speech signal:where and is convolution operation. Ultimately, is inversed from segmentation into an enhanced speech waveform by overlapping.

2.5. Loss Function

In training and evaluation, the scale-invariant source-to-noise ratio (SI-SNR) is used as the loss function.where is the denoised speech and is pure speech signal.

3. Experiment Section

The speech enhancement tasks are evaluated in four kinds of microphone array structures to simulate the location of the microphone in the car. The speech source and locations of the noise source are shown in Figure 5, where the black circle represents the microphones, the green square represents the speech source, and the red five-pointed star represents the noise source. The design of the microphone array is as follows:(i) Consider a uniform linear array with 2 microphones with intermicrophone distance of 3 cm, and the microphone array is located in the front of the car cockpit, as Figure 5(a)(ii)Consider a uniform linear array with 2-uniform linear distributed 2-channel microphone array with intermicrophone distance of 3 cm, and the microphone array is located in the front and middle of the car cockpit, respectively, as shown in Figure 5(b)(iii)Consider a uniform linear array with 4 microphones with intermicrophone distance of 3 cm, and the microphone array is located in the front of the car cockpit, as shown in Figure 5(c)(iv)Consider a distributed array with 4 microphones with intermicrophone distance of 80 cm, and the microphone is located around the car cockpit, as shown in Figure 5(d)

Different microphone array structures can reflect different spatial characteristics. In order to make the method independent of the spatial position of the required speech source, each microphone array position and source-array distance are considered under the training condition.

3.1. Datasets Building

For training, we used 3000 randomly chosen speech utterances from the LibriSpeech [13] dataset which are open and well-studied dataset used for speech enhancement, each 4 s long, with sampling frequency of 16 k Hz, and 500 were used as a validation set. Volvo car noise [14] was added to the training data as noisy speech in the car cockpit with randomly chosen SNRs between −10 dB and −5 dB. Additionally, since the number of noise is small, spsquare noise [15] as a noise source, with randomly chosen SNRs between −10 dB and −5 dB, was also added. All dataset are divided into frames with 64 sampling points length, 50% overlapping, and the context window is 256.

To simulate a car cockpit, we designed the space size to be 3.4 meters long, 1.8 meters wide, and 1.4 meters high. The vehicle cockpits impulse responses required to simulate real acoustic conditions are generated by gpuRIR toolbox [16], with the reverberation time (T60) selected from 0.1 seconds to 0.3 seconds randomly.

3.2. Experiment Settings

The experiment aims to verify the generalization capability of the proposed method over different microphone arrays and compare the performance to that of traditional beamformers. For a fair comparison, we make the comparison as all models, including NCC, MCS, and ICD, are based on the same two-stage BiLSTM modules presented in Section 2.3 for each microphone channel. The architecture of each BiLSTM network consists of 128 hidden layers. Set layer number 4. For MCS and ICD, the size of the convolution kernel is 64, the number of convolution kernels is 16, the step size is 2, and the expansion number is 2, leading to the output where the filter estimate for each microphone is obtained.

The BiLSTM network was trained using the Adam-based optimizer, with minibatches of 128 input signals and a learning rate of 0.001. Meanwhile, the L2 norm of 5 is used for gradient pruning to prevent gradient explosion. During training, if the loss value of the latest 10 epoch model does not decrease on the validation sets, the training will stop automatically. Dynamic strategy warmup [17] is used to adjust the learning rate during the training. This operation can warm up the model at a small learning rate in the initial stage to increase the stability of the model and then gradually reduce it with a decay rate of 0.98 every 2 epochs. The specific approach is similar to [18]. All the implementations were done in PyTorch:where is the number of training steps and, , , and present the hyperparameter. In the experiment, we set , , , and .

4. Results and Discussion

Following the common speech enhancement metrics, we adopt average SI-SNR, SDR, PESQ, and STOI improvement to evaluate the performance of multichannel speech enhancement. For a more comprehensive evaluation of the speech quality, we also report the performances under different SNRs of speech and noise to give a more comprehensive model assessment. The experimental results are summarized in Table 1, where the highlighted numbers with black are the best scores for each model. The results indicate that the performance of proposed method is better than other methods when tested at different SNRs, which verifies the effectiveness of the model. By assigning weight values to each channel frame by frame, using attention mechanism to learn the feature expression between channels, the proposed method leads to the best improvement in terms of four metrics. It learns from the magnitude spectrum and phase spectrum of the individual microphone signals and exploits the difference in the spatial characteristics of the speech and noise sources.

In the four kinds of microphone array structures designed in the experiment, we obtain 13.60 dB improvement in SI-SNR in the structure of 2 microphones with −10 dB SNR and 14.76 dB improvement in SI-SNR in the distributed 4-microphone array structure.

Another conclusion from the experimental results is that the array structure with four microphones is better than that with the two microphones, indicating that the more the channels are, the more the feature information can be provided to the speech enhancement model. In addition, compared with the other structures, the 4-channel distributed microphone array has the optimum performance. The SDR increase [15.24, 13.87], respectively, in SNR = −10 dB and −5 dB, and the performance improvement of the other structures are [14.16, 11.63], [14.79, 12.31], [14.32, 12.79], respectively. The distributed microphone array structure has advantages in obtaining the spatial characteristics of the entire cockpit due to the difference in the location of the speech source and the noise source, which is helpful to train a better beamforming filter.

Figure 6 is the speech spectrogram, including the pure speech spectrogram, the noisy speech spectrogram with SNR = −10 dB, and the speech spectrogram enhanced by four methods. The four methods have good noise reduction effects. Compared with the enhanced noise energy spectrum in the box, the method proposed in this paper has significant advantages. At the same time, compared with the enhanced speech spectrogram and pure speech spectrogram, the method did not cause speech damage and ensured the integrity of the speech signal.

5. Conclusions

This work proposed an interchannel attention mechanism frame by frame (IAF) method and jointed with the two-stage BiLSTM network to learn the spatial features directly from multichannel waveforms to solve the problem of multichannel speech enhancement in the car cockpit. Experimental results show the IAF method is more effective than the traditional NCC, MCS, and ICD method in learning spatial features directly from the multichannel speech waveforms. The proposed model based on four distributed microphone arrays obtains the optimal enhancement performance in terms of SDR, SI-SNR, STOI, and PESQ. The results indicated that the method is suitable for different structures of the microphone array and has good robustness. This work provided valuable conclusions for improving the performance of multichannel speech enhancement in the vehicle cockpit. In future work, we will explore the effect of the position of the voice source on the performance using the proposed method.

Data Availability

In order to facilitate the further research of other researchers, the LibriSpeech data in this article can be found at http://www.openslr.org/12/, the Volvo car noise data can be found at http://spib.linse.ufsc.br/noise.html, and the spsquare noise data can be found at https://zenodo.org/record/1227121#.YP0sjo4zZhG.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the fund project of Education Department of Liaoning Province (nos. LJKZ0338 and LJ2020FWL001) and the Undergraduate Innovation and Entrepreneurship Training Project (no. 202110147019).