Abstract
Speech enhancement in wireless acoustic sensor networks requires the exchange of audio signals. Since the wireless communication often dominates the nodes’ energy budget, techniques for data exchange reduction are crucial. Adaptive quantization aims to optimize the bit depth of each exchanged signal according to its contribution to the speech enhancement performance. This enables the network to scale its energy and communication bandwidth requirements according to the current operating environment. The impact metric was previously proposed to predict the effect of quantization in linear minimum mean squared error (MMSE) estimation. We provide new insights into greedy adaptive quantization based on this impact metric. We achieve this by expanding the mathematical framework to include a new metric based on the gradient of the MMSE as a function of the quantization noise power. Using these tools, we show how the MMSE gradient naturally leads to a greedy algorithm and how the impact metric is a generalization of the gradient metric and a previously proposed metric. Besides, we validate the impact metric for adaptive quantization both in a simulated and in a real wireless acoustic sensor network deployed in a home environment, showing the energy savings achievable through greedy adaptive quantization.
1. Introduction
A wireless acoustic sensor network (WASN) is a collection of batterypowered sensor nodes where each node is equipped with a microphone or microphone array, a processing unit, and a wireless communication module [1]. The nodes are distributed over an area of interest with the goal of performing a signal processing task such as noise reduction or acoustic localization. The main advantage of a WASN over a single standalone microphone array is its extended coverage, which is made possible by placing many microphones over the area of interest. This typically translates into a better performance, as microphone array algorithms benefit from enhanced spatial diversity. Furthermore, the deployment of a WASN often yields a higher probability to have microphones close to a sound source, which is advantageous since these microphones will record signals with high signaltonoise ratio (SNR).
Nevertheless, WASNs pose several technical challenges that are not present in standalone microphone arrays, such as internode synchronization, delay management, communication bandwidth usage, and energy efficiency. The latter, energy efficiency, is crucial to allow the network to perform its task for a reasonable period of time, since nodes are mostly powered by batteries and hence have a tight energy budget. A significant effort has been made to classify the different approaches to improve energy efficiency in wireless sensor networks (WSNs), as the optimal techniques depend on the intended WSN application. A comprehensive taxonomy of these approaches can be found in [2], and a more recent survey in [3] also considers the importance of the different techniques for specific classes of applications of WSNs.
In this paper we focus on a speech enhancement application for a WASN, where the goal is to estimate a desired speech signal while suppressing interfering sound sources and noise. In particular we focus on the multichannel Wiener filter (MWF) [4–6], which is a multimicrophone noise reduction algorithm that produces a linear minimum mean squared error (MMSE) estimate of the desired speech component in the signal captured by one of the microphones. The algorithm does not rely on a priori knowledge of the microphone or sound source locations, which makes it suitable for a WASN since nodes are usually randomly deployed and may even be mobile (e.g., if a node is carried by a person, such as a mobile phone or a hearing aid).
1.1. Sensor Subset Selection
A substantial part of previous research on energy efficiency in WSNs has been focused on the sensor subset selection problem, which is aimed at using only the signals from those sensors (microphones, in the case of WASNs) that provide a significant contribution to the signal processing task at hand, while putting other sensors to sleep. This saves energy by avoiding the transmission of signals from sensors with low relevance and allows the communication bandwidth resources to be allocated to the transmission of the signals from the most useful sensors. The sensor subset selection problem is combinatorial and thus difficult to solve in general. Due to its importance, it has been the focus of extensive research, and several techniques have been proposed to tackle it. For an overview of these techniques, the reader is directed to [7]. Recent work on sensor selection can be found in [8, 9] and references therein. In [8] the authors investigate the sensor selection problem for parameter estimation in a WSN where the sensor measurements follow a nonlinear model, assuming that the measurements are independent random variables. The problem is formulated as a nonconvex optimization problem and solved through convex relaxation. In [9] the authors develop a more general framework where they consider correlated measurement noise and propose a greedy algorithm to solve the sensor selection problem based on the Fisher information matrix.
A different approach has been proposed to solve the sensor selection problem for signal estimation based on a greedy algorithm using the utility metric [10, 11]. The utility of a sensor signal is defined as the change in estimation performance when the sensor is removed from the estimation process and the corresponding estimator is subsequently reoptimized. The motivation is that the utility can be computed and tracked at a very low computational cost, which combined with the greedy approach allows performing sensor subset selection swiftly and at low complexity, even though the solution will generally be suboptimal. Besides, the algorithm is fully datadriven and does not require any prior knowledge of the underlying measurement model, such as the microphone and source positions or the acoustic transfer functions, which indeed is generally not available in WASN applications. This priority on speed and low complexity is crucial for adaptive signal estimation, since the network needs to rapidly react to the changing signal conditions (e.g., sound sources moving in the case of a WASN) and has to avoid investing too much energy from the already limited budget of the nodes. This approach has been specifically applied to WASNs [12], and it has been extended to a distributed implementation of the MWF [13].
1.2. Adaptive Quantization
While sensor subset selection does indeed help to save energy and communication bandwidth, it forces the nodes into a binary behaviour; that is, they either transmit their signals at full resolution or they are put to sleep. One technique to provide a more flexible scaling of the estimation performance and the energy consumption of the network is adaptive quantization, where each sensor signal is assigned a variable bit depth to encode its signal samples according to its contribution to the estimation performance. By using this technique, nodes are able to spend more or less energy on data transmission according to the estimation performance required. From the point of view of information theory, this problem can be tackled using source coding techniques. A comprehensive overview of source coding for WASNs can be found in [14, 15], where the focus is directed towards theoretical results based on ratedistortion theory.
In [16], a pragmatic approach is taken, in which a generalized version of the utility metric referred to as the impact metric is introduced to predict the MMSE increase in the estimation due to the quantization noise. This allows modeling the effect of the quantization noise resulting from changing the bit depth of each sensor signal’s samples on the estimation performance. The impact metric can be used by a heuristic algorithm to gradually decrease the bit depth in each sensor signal until a target MMSE (or corresponding SNR) is met.
1.3. Contributions and Outline of the Paper
The goal of this paper is twofold. Our first goal is to provide some new insights on greedy adaptive quantization based on the impact metric from [16]. To this end, we expand the mathematical framework for adaptive quantization in linear MMSE estimation and we apply it in a WASN with a centralized processing architecture. We consider the MMSE as a function of the quantization noise power in each sensor signal, and based on this we define a new metric for adaptive quantization based on the gradient of the MMSE. We demonstrate how this MMSE gradient naturally gives rise to a greedy algorithm. We then show how the impact metric is in fact a generalization of this gradient metric, which then also motivates the use of a greedy algorithm using the impact metric. Besides, we explain how the utility metric for sensor subset selection [10, 11] can be viewed as another limit case of the impact metric. Finally, we discuss the theoretical advantages and disadvantages of each metric and propose a correction to improve the gradient metric.
The second goal of the paper is to validate the impact metric for adaptive quantization in a speech enhancement task in a simulated as well as in a real life WASN in a home environment. We compare the behaviour of the four metrics and show the superiority of the impact and the corrected gradient metrics over the gradient and utility metrics due to their inherent adaptation to the significance of each quantization bit. To conclude, we provide an estimation of the savings in transmission energy achievable through the use of the greedy adaptive quantization algorithm based on the aforementioned metrics.
The paper is structured as follows. In Section 2, we formulate the problem statement and signal model, we briefly review the multichannel Wiener filter for speech enhancement, and we introduce the quantization error model that is used throughout the paper. In Section 3 we model the effect of quantization noise in linear MMSE estimation and show how adaptive quantization can be performed based on four metrics derived from this model (utility, impact, gradient, and corrected gradient). In Section 4 we show experimental results of adaptive quantization for speech enhancement performed on real recordings from a WASN. Finally, we present the conclusions in Section 5.
2. Problem Statement
We consider a WASN composed of several nodes, each having one or more microphones, with microphones in total. The signal samples of the th microphone signal are encoded, upon acquisition by the analogtodigital converter, with a certain bit depth dictated by the hardware in use. We consider a centralized scheme for the network, where each node transmits its microphone signals to a fusion centre, which could be one of the nodes in the WASN or an external node with access to more computational power or energy resources. The fusion centre’s task is to obtain an estimate of the desired speech component present in one of the microphone signals, which will be referred to as the reference microphone signal (the reference microphone does not necessarily belong to the fusion centre; the microphone of any node can be selected to be the reference). This speech enhancement task is solved in the fusion centre through the use of a multichannel Wiener filter [4–6], which produces a linear MMSE estimate of the desired speech signal component in the reference microphone signal. We will give a brief review of the MWF in Section 2.2.
Our main focus will be the problem of reducing the bit depth of each individual microphone signal in the WASN according to its contribution to the speech enhancement performance. The bit depth reduction leads to a reduction in the required communication bandwidth and in the node’s required energy budget for wireless transmission, but it will also have an impact on the speech enhancement performance. Besides, the contribution of each node to the enhancement performance is subject to changes in the acoustic scenario, so we will focus on strategies with low computational complexity that allow the fusion centre to perform a quick decision on the desired bit depth assignment for each individual microphone. This enables each node at runtime to scale down the energy spent in wireless transmission according to the current operating environment.
An illustration of the problem is given in Figure 1, where a small network with two nodes and a fusion centre is depicted. The nodes quantize the signals of each individual microphone with the corresponding bit depth before transmission. The fusion centre performs the speech enhancement task using the transmitted quantized microphone signals (dotted lines) and takes a decision on the optimal bit depth for each communicated microphone signal (dashed lines).
In the remaining part of this section we introduce formally the signal model for the WASN, we briefly review the multichannel Wiener filter for speech enhancement, and we explain the quantization error model we will use throughout the rest of the paper.
2.1. Signal Model
We denote the set of microphones by . The signal captured by the th microphone can be described in the shorttime Fourier transform domain (STFT) aswhere is the frame index, represents frequency, is the desired speech signal component, and is the undesired noise signal component. We assume that and are uncorrelated. We note here that contains all undesired sound signals, which may include speech from undesired speakers besides acoustic noise. For the sake of simplicity, we will omit the indices and throughout the rest of the paper, keeping in mind that all operations take place in the STFT domain unless explicitly stated otherwise.
The fusion centre stacks all signals in the vectorThe vectors and are defined in a similar manner, so the relationship is satisfied.
2.2. Multichannel Wiener Filter
In speech enhancement, the goal is to obtain an estimate of the speech component present in the microphone signal selected as the reference. We will focus on the multichannel Wiener filter to perform the speech enhancement task, and we will provide a brief summary in this section. For more information the reader is directed to [4–6].
The multichannel Wiener filter is the linear estimator that minimizes the mean squared error (MSE)where is the expectation operator and the superscript denotes conjugate transpose. When the microphone signal correlation matrix is full rank (in practice, this assumption is usually satisfied because of the presence of a noise signal component in each microphone signal that is independent of other microphone signals, such as thermal noise. If this is not the case, matrix pseudoinverses have to be used instead of matrix inverses), the solution to the minimization problem is given bywhere and the superscript denotes complex conjugation. Since we assume that and are uncorrelated, is given by , where is the desired speech correlation matrix and is the vector , where the entry corresponding to the reference microphone signal is equal to one.
The matrix can be estimated by temporal averaging, for instance, using a forgetting factor or a sliding window. Temporal averaging is not possible for since the desired speech signal components are not observable. In practice, the noise correlation matrix can be estimated during periods when the desired speech source is not active, as indicated by a voice activity detection (VAD) module. Since we assume that and are uncorrelated, it is then possible to use the relationship to obtain an estimate of . However, this is prone to robustness issues, created by oversubstraction, leading to the estimated desired speech correlation matrix not being positive semidefinite. These issues arise often in high frequencies, where the desired speech component may have very low power. To improve robustness in low SNR and nonstationary conditions, an implementation based on the generalized eigenvalue decomposition (GEVD) can be employed [17, 18].
The minimum mean squared error (MMSE) can be obtained by plugging (4) into (3) to obtainwhere is the power of the desired speech signal.
2.3. Quantization Error Model
We will consider uniform quantization of the time domain samples of each microphone signal , prior to the transformation to the STFT domain. In practice, this means that the nodes transmit their time domain samples and the STFT is performed in the fusion centre. We discuss the possibility of quantizing the STFT coefficients directly prior to transmission in Section 3.4. This configuration would require each node to perform the STFT over its own microphone signals and transmit the frequency domain coefficients to the fusion centre.
The quantization of a real number with bits can be expressed aswhereIn practice, the parameter is given by the dynamic range of the analogtodigital converter of the corresponding microphone. The quantization error, or noise, is then defined asThe mathematical properties of the quantization noise have been the subject of extensive study [19–21], where it has been shown that the input signal and the quantization noise are uncorrelated under certain technical conditions on the characteristic function of the input signal. Under the same conditions, the mean squared error due to quantization is given by
We consider that, for the th microphone signal, the time domain samples of are quantized with bits according to (6) before being transmitted to the fusion centre. The quantization error can be expressed aswhere indexes the samples of frame . The fusion centre performs the STFT and collects the results for each frequency and frame in the vector given bywhere is the vector whose th element is the quantization error corresponding to the th microphone signal at frequency . Note that all signals have been included in the quantization process. However, if the fusion centre is also equipped with microphones (e.g., it is a node of the WASN), these signals do not need to be transmitted and hence have a fixed quantization. In this case, the microphone signals from the fusion centre are removed from the adaptive quantization process, but they are still included in the estimation process.
Using the statistical properties of the quantization error [19–21], we will assume that every element of is uncorrelated with every element of . Again, under certain technical conditions, the power spectrum of the quantization noise is white; that is, its power is evenly distributed across all frequencies [19]. Although these conditions are not always satisfied in practice, particularly for quantization with only a few bits, we will combine this property with (9) to approximate the quantization noise power at each frequency aswhere is the length of the discrete Fourier transform (DFT) used to implement the STFT in practice. The factor in (12) appears as a consequence of the application to of Parseval’s theorem for the nonunitary DFT, given bywhere is the point DFT corresponding to . The nonunitary definition of the DFT is given bywhere is the input sequence, is the imaginary unit, and is the resulting transformed sequence. If a factor of is applied to the righthand side of (14) the DFT becomes a unitary transformation and the factor is no longer needed in (12). In the rest of the paper we assume that the nonunitary DFT is used to implement the STFT, keeping in mind that the unitary DFT can be employed simply by rescaling (12).
3. Adaptive Quantization for the Multichannel Wiener Filter in a WASN
We now consider the effect of quantization noise on the estimation process described in the previous section. Our interest here is to study how changing the bit depth for the transmission of the microphone signal samples affects the operation of the MWF, in particular, how it affects the MMSE. The analysis of this effect will lead to a metric based on the gradient of the MMSE which, as we will show, naturally leads to a greedy adaptive quantization algorithm. We will then demonstrate how this gradient metric is a limit case of a recently proposed impact metric [16], which was already known to also generalize the utility metric proposed in [10, 11]. Besides, based on this reasoning, we propose a correction to improve the gradient metric for adaptive quantization. This analysis provides a motivation for applying a greedy algorithm based on any of these metrics, which allows dynamically changing, at any moment in time, the bit depth assigned to each microphone signal. In Section 4, we will demonstrate experimentally that the impact and the corrected gradient metrics outperform the gradient and utility metrics, due to their inherent adaptation to the difference in quantization levels corresponding to different bit depths.
3.1. Effect of Quantization on the Minimum Mean Squared Error
The MWF based on the quantized microphone signal samples is obtained following (4) aswhere . Using (11) and the assumptions stated in Section 2.3, we express asThe quantization error correlation matrix is diagonal (one could intuitively expect quantization to reduce the crosscorrelation between the microphone signals. In the Appendix we consider a quantization model that includes this reduction and show that its effect on the MWF is equivalent to the one presented in Section 3.1), with the th element of the diagonal being , where is defined in (12). As is assumed to be uncorrelated with , the crosscorrelation remains unchanged; that is,As explained in Section 2.2, can be computed aswhere , which indeed confirms (17). Similarly to (5), we can now find the MMSE corresponding to , given byWe highlight that is a function of the quantization error powers , which can be made explicit by rewriting the function aswhere is the vector of quantization error powers and where is the operator that generates a diagonal matrix with diagonal elements equal to the entries of the vector in its argument. Equation (21) is important because it defines the cost function that we will use as the basis for adaptive quantization, since it is the minimum mean squared error that can be obtained with a linear estimator (i.e., the MWF) after adding quantization noise to each microphone signal. We emphasize that (21) gives the MMSE when the MWF is first reoptimized using the quantized microphone signals, that is, based on (15), and not the mean squared error resulting from applying the original (optimized for the nonquantized signals) MWF to the quantized microphone signals.
3.2. GradientBased Approach to Adaptive Quantization
The goal of adaptive quantization is to allocate a bit depth to each sensor which is smaller than (or at most equal to) an initial maximum bit depth. Since each bit depth reduction also reduces the speech enhancement performance, the goal becomes to find the bit depth allocation which uses the minimum total number of bits given a maximum tolerated MMSE. Equivalently, the problem could be stated as finding the lowest MMSE with a given total number of bits .
The gradient of the function gives the direction of maximal increase of the MMSE for a given , that is, for a given bit depth allocation. To further reduce the total number of bits beyond the bit depth allocation corresponding to , has to be changed to , where is constrained to have nonnegative entries. The corresponding MMSE increase for an infinitesimally small is then given by the inner product of and the gradient of . In order to compute this gradient, we will use the intermediate stepwhich follows from applying the identity [22]together with the fact that is a Hermitian matrix. Equation (22) can be simplified using (15)–(17) to obtainSince the matrix is diagonal, we can now find the gradient as the diagonal of the righthand side term in (24); that is,where the operator is applied elementwise to its argument.
To minimize the MMSE increase for an infinitesimally small , the inner product has to be minimized. However, every component of is nonnegative and the vector is also constrained to have nonnegative components. Hence the best choice for is a vector whose components are all zero except the one corresponding to the minimum element of .
This result shows that when adding a small amount of quantization noise, it should be added to a single microphone signal instead of dividing it over multiple microphone signals. This naturally leads to a greedy algorithm, where at each step the gradient is computed from the MWF using (25), after which its minimum element is identified and the bit depth for the corresponding microphone signal is reduced by bits. Note that the above reasoning has assumed the vector to be a continuous variable; that is, each element of the vector can take any real value. However, the bit depth is a discrete variable and it determines the quantization noise power added to a signal. Hence, the smallest possible quantization power that can be added to a signal corresponds to reducing its bit depth by 1 bit, which is the recommended value for in order to avoid taking a too large step. This also avoids reducing the bit depth of one signal too quickly, which may be a poor choice compared to distributing the bit reduction over several signals. After removing a bit from the microphone signal with the smallest entry in the gradient vector, the MWF is reoptimized to the new bit depth assignment, and the gradient is recomputed. This process is continued until the MMSE exceeds a predefined threshold.
3.3. Alternative Metrics for Adaptive Quantization
In this section, we will show how the gradient metric used in the previous section is a limit case of the impact metric, which has been used in [16] for adaptive quantization. This provides an intuitive explanation of why the greedy approach, which follows naturally from the gradient metric, also works well when using this impact metric, as will be demonstrated in Section 4.
The impact metric from [16] was initially proposed as a generalization of the utility metric defined in [10, 11]. The utility of the th microphone signal is defined as the increase in MMSE when is removed from the estimation [10]. The mathematical expression of this definition is given bywhere is the reoptimized MWF obtained with all signals except . Assuming the MWF is known, then the utility of is shown [10] to be equal towhere is the th element in the diagonal of and is the th element of .
The impact of the noise is defined as the increase in MMSE when the uncorrelated noise signal is added to , while other microphone signals remain unchanged [16]. In mathematical terms the definition can be expressed aswhere is the reoptimized MWF for , as in (15), with . In [16] the impact is shown to be equal towhere is again the th element in the diagonal of , is the th element of , and represents the power of the noise added to , given by (12) for the case of quantization noise.
To simplify further notation and the comparison between different metrics, we consider the gradient for the case , where is the zero vector, such that (25) is rephrased as , where (the comparison is valid for any ; we choose this case purely to simplify the notation) each element is given by
Despite the fact that the impact (29), utility (27), and gradient (30) metrics predict a change in the minimum mean squared error, which implicitly requires to reoptimize the MWF, all three metrics can be calculated from the current MWF coefficients at almost no additional computational cost compared to the computation of itself.
By comparing (29) with (27) and (30), we see that both the gradient and the utility are limit cases of the impact when and , respectively. Although would obviously give an impact equal to zero, the relative differences between the impact metric for different become equal to those of the gradient metric.
These two limit cases can be interpreted as follows. For the utility, the interpretation is that removing the microphone signal from the estimation process is similar to adding an infinite amount of noise on (), making it completely useless, which corresponds to a removal of that channel. For the gradient, the distinction between the gradient and the impact is that the gradient characterizes the best linear approximation of the function , while the impact computes the actual MMSE increase produced by adding the error with power . Since the gradient approximation is only valid in an infinitesimally small neighbourhood, it is only able to accurately capture the influence of on the MMSE for small values of . Besides, note that the quantization noise power increases exponentially with each bit reduced, so the gradient becomes less accurate as the microphone signals are quantized with lower resolution. On the other hand, the impact metric accounts directly for , which makes it inherently adaptive to the significance of each bit considered for removal. For low significance bits, the impact is close to the gradient. However, as the significance of a bit increases, the impact behaves more like the utility. By contrast, the gradient assumes that corresponding to a bit removal is the same for all , or in other words it assumes that the search space is isotropic, which only holds true when all microphone signals have the same bit depth. This can be adjusted by making in (21) a linear function of the resolution corresponding to the least significant bit, for example, , and taking the derivative with respect to . This would then provide a warped gradient vectorwhere . Note that this warped gradient is again an asymptotic case of the impact measure, if is substituted with in (29) and letting .
3.4. Frequency Domain Considerations
To conclude, we must turn our attention to the fact that all of the above is valid at each frequency . This opens the possibility to assign a different bit depth to each frequency component of each microphone signal .
In Section 2.3 we took the approach of performing quantization in the time domain. In order to select the signal from which a bit is to be removed, we need to choose a rule to combine each metric across all frequencies. We propose to perform a sum of the metrics across all frequencies. For instance, for the impact the combined metric would be given byFor the utility, gradient, and warped gradient the combined metric is defined in a similar way. It is noted that one could as well use a weighted sum in (32), for example, based on speech intelligibility weights. We provide a summary of the greedy quantization algorithm based on any of the four metrics described so far in Algorithm 1.

However, strategies to allow the assignment of a different bit depth to each frequency component can be considered, as is commonly done in audio coding, to represent the most relevant frequency components with higher accuracy. Instead of assigning a different bit depth to every single frequency bin, frequency bins can also be grouped in a set of frequency bands , where comprises all frequency bins such that . This means that every STFT coefficient of each microphone signal at the frequency band is quantized following (6) with bits. The real and imaginary parts of each STFT coefficient are quantized independently. The corresponding metric can be computed in a similar way to (32) aswhere is the impact corresponding to the th microphone signal in the th frequency band. For the utility, gradient, and warped gradient the combined metric is again defined in a similar way.
This configuration opens up several strategies to decide which frequency band and microphone signal will have its bit depth reduced in each iteration of the algorithm. For our discussion we consider the strategy of removing, in each iteration, one bit in each frequency bin assigned to the frequency band of the microphone signal with minimum . This is the most conservative greedy strategy, which can be viewed as a limit case that will generally provide a better performance compared to greedier strategies where the bit depth is reduced in multiple channels and frequency bands simultaneously. It is noted that a more conservative greedy strategy comes with the cost of a larger number of required iterations to reach a predefined total number of bits. In Sections 4.1 and 4.2 we show the performance of this particular strategy applied to a speech enhancement scenario.
Note that, in every iteration, the bit depth in (out of ) frequency bins is reduced, which corresponds to a reduction of bits per time domain sample. This is less than the full bit per sample reduction achieved through time domain quantization, which shows that the proposed strategy for frequency domain quantization is more conservative than the strategy for time domain quantization.
Besides, it is important to mention that frequency bands do not influence each other in the sense that the bit depth reduction in one band will not affect the decision in the rest of the bands. In the case of nonuniform bands, where each frequency band spans a different number of frequency bins, a tradeoff with the transmission energy has to be considered, that is, removing a bit from a wider frequency band will introduce more quantization noise but will result in less energy spent in transmission since the total number of bits will be lower.
4. Experimental Results
In this section we discuss the results obtained from several experiments to observe and characterize the performance of the greedy adaptive quantization algorithm based on the four metrics described in Section 3. We will discuss experiments on two different audio datasets. In the first one the audio signals captured by the microphones are obtained by simulating the acoustics of a room with the image method [23]. In the second one, the audio signals were recorded using a wireless acoustic sensor network set up in a real home environment in a house in Mol, Belgium, using nodes designed by researchers from the MICAS group of the Department of Electrical Engineering (ESAT) in KU Leuven. The details of each experiment will be discussed in Sections 4.1 and 4.2. In all experiments the desired speaker audio consists of three sentences, spoken by a female speaker, from the TIMIT database [24]. The noise characteristics will be described in the section corresponding to each experiment. The sampling frequency is = 16 kHz. The audio processing is implemented in batch mode, where the correlation matrices and are estimated using samples over the entire length of the microphone signals. An ideal VAD is used to exclude the influence of speech detection errors. The audio signals are divided into frames using a Hann window with 50% overlap, and the STFT is implemented using a discrete Fourier transform (DFT) of length . The multichannel Wiener filter is computed based on a GEVD of and as in [17] since, as we mentioned in Section 2.2, this method is superior to the subtractionbased implementation.
In order to assess the changes in noise reduction and speech distortion due to the bit depth reduction we will use two figures of merit, the speech intelligibility weighted signaltonoise ratio (SISNR) [25] and the speech intelligibility weighted spectral distortion (SISD) [6]. They are based on the band importance function , which expresses the importance for intelligibility of the th onethird octave band with centre frequency . The values for and are defined in [26]. The definitions of the two figures of merit are given byThe quantity is the SNR (in dB) in the onethird octave band with centre frequency . In order to account for quantization, the quantization noise in the input signals can be obtained by subtracting the clean input signal and its corresponding quantized version. The quantization error obtained is added to the noise component of each microphone, and they are filtered to obtain the noise component in the output signal, which is then used to compute the noise power at each onethird octave frequency band.
For the SISD, is the average spectral distortion in the onethird octave band with centre frequency , given byThe function is given bywhere is the speech component at the output of the MWF, and is the frequency domain speech component at the reference microphone signal. A distortion value of 0 indicates undistorted speech, while larger values correspond to increased speech distortion. To account for quantization, is computed by first quantizing the speech component at each microphone with the corresponding bit depth and then applying the filter to the quantized speech components.
4.1. Simulated Room Acoustics
Our first experiment is a study of the behaviour of the greedy algorithm for adaptive quantization using simulated room acoustics. The scenario consists of a room of dimensions 5 × 5 × 3 m, with a reverberation time of 0.2 s. In the room there are two babble noise sources [27] and one desired speech source. The WASN consists of four nodes, where each node is equipped with three omnidirectional microphones, such that the total number of microphone signals is . Independent white Gaussian noise was added to each microphone signal with a power of , about 1% of the power of the babble noise impinging on the microphones. A 2D diagram of the acoustic scenario is depicted in Figure 2. All sources are located at a height of 1.8 m, while the nodes are placed 2 m high. The intermicrophone distance at each node is 4 cm and the sampling rate is 16 kHz. The maximum bit depth was set to 16 bits. The broadband input SNR for every microphone lies between 0 dB and 5 dB. The acoustics of the room are modeled using a room impulse response generator, which allows simulating the impulse response between each source and each microphone using the image method [23]. The code is available online (https://www.audiolabserlangen.de/fau/professor/habets/software/rirgenerator). The total duration of the signals is 20 seconds.
In Figures 3 and 4 we can see the SISNR and SISD at each iteration of the greedy adaptive quantization algorithm presented in Algorithm 1 based on the four metrics discussed. In this experiment the quantization is performed in the time domain, as explained in Section 2.3, such that each time domain sample of the microphone signal is quantized using its allocated bit depth . Note that both the SISNR and the SISD are plotted versus the average bit depth per sample and channel at each iteration, given by . In terms of SISNR, the impact metric performs better than both the utility and the gradient, as we expected due to its inherent adaptability to the significance of each bit for different bit depths. The same can be said about the warped gradient, which performs better than the uncorrected gradient and close to the impact due to the correction to account for the significance of each bit. In terms of distortion, there is no clear winner when the total number of bits is high. However, the impact and the warped gradient introduce the least distortion as the number of bits decreases.
We now turn our attention to quantization in the frequency domain, where each microphone signal has a bit depth allocated to its frequency band , as explained in Section 3.4. The STFT coefficient at each frequency bin is quantized using bits. In each iteration, one frequency band at one microphone signal has its bit depth reduced by one. The pair is given by the channel and band with minimum impact (or corresponding metric). For this experiment we considered uniform frequency bands, each spanning frequency bins. The bit allocation of any band can be reduced to a minimum of 2 bits. If all bands of a microphone signal are assigned 2 bits, the signal is removed from the estimation process for subsequent iterations. In Figures 5 and 6 we can again see the SISNR and SISD at each iteration of the greedy adaptive quantization algorithm. The two figures of merit are plotted versus the average bit depth per sample and channel , where . We can observe again the impact and the warped gradient performing better in terms SISNR, which is consistent with our previous experiment. However, the decay in SISNR for the utility and the gradient is less pronounced, and the region where their performance is similar to the impact and the warped gradient is larger. In terms of speech distortion the results are also consistent with the previous experiment in the sense that there is no clear winner, although the impact seems to perform better as the number of bits decreases for this particular experiment.
4.2. Experiments on Real Recordings
In order to further compare the four metrics for greedy adaptive quantization, we turn our attention to an audio scenario where the signals are recorded using a real life wireless acoustic sensor network set up in a house in Mol, Belgium, consisting of 6 nodes with 4 microphones per node. A 2D schematic of the whole house can be seen in Figure 7, although only the living room was used for this experiment. The acoustic scenario consisted of one loudspeaker acting as the desired speaker (represented by the blue circle) and a kitchen fan (located in the top right corner of the living room in the 2D schematic) acting as the noise source. Only the nodes marked 1, 2, 3, 6, 7, and 8 were used for this experiment. The speech signal for the loudspeaker consisted of three sentences from the TIMIT [24] database, spoken by a female speaker. The total duration of the recording was 23 seconds.
The microphones employed were Sonion N8AC03 (analog), and the intermicrophone distance at each node was 5 cm. A picture of one node with the location of the microphones indicated is shown in Figure 8. The sampling frequency was = 16 kHz, and the analogtodigital converter of every node was configured to use a bit depth of 12 bits for acquisition. The microcontroller unit in each node is the Wonder Gecko EFM32WG980 from Silicon Labs [28], which is used for sampling and sending data to a Raspberry Pi 3 [29] via USB. The Raspberry Pi at each node is used to upload the audio samples to a USB drive. A picture of one node can be seen in Figure 8. The nodes were synchronized once every second using a pulse that was sent through coaxial cable and triggered by a GPS/DCF receiver. The recorded audio signals were stored and subsequently processed using the MATLAB software as described at the beginning of Section 4. We implemented the processing offline to focus on the characterization of the performance of the bit depth reduction algorithm and the comparison of the different metrics using real audio data.
In Figure 9 we can see the results of the SISNR of the output signal estimated from the MWF using the recorded audio signals. In this case, quantization was performed in the time domain. The SISNR of the input microphone signals lied between −16 and −7 dB. The noise power for the SISNR calculation was computed using the nonspeech segments. The greedy adaptive quantization algorithm was stopped when the total number of bits used was 20 bits. It can be observed that the impact metric again outperforms the gradient and the utility metrics and provides a smoother way of downscaling the WASN performance, in agreement with the results from Section 4.1. Besides, the warped gradient performs very close to the impact due to the correction to account for the significance of each bit, again in agreement with the results from Section 4.1. We would like to note that the impact and the warped gradient outperforming the gradient and the utility, as we can observe in both Figures 3 and 9, agree with the theoretical discussion of Section 3.3, where we describe the limitations of each metric. The four metrics achieve a similar performance only in the high resolution regime, where the samples from every signal are encoded with a high bit depth and the bits removed have low significance.
Finally, we turn again our attention to quantization in the frequency domain, as explained in Section 3.4. We followed the same strategy as in the previous section, where we consider uniform frequency bands, each spanning frequency bins. In Figure 10 we can see the behaviour of the SISNR for this experiment, where a slower decay compared to the evolution in Figure 9 is observed. Although the impact outperforms the rest of the metrics, the four metrics diverge less from each other compared to the time domain quantization as seen in Figure 9. We note that for this experiment the warped gradient performs worse than the utility and the gradient.
4.3. Analysis of Energy Consumption
To conclude, we focus on estimating the energy savings that can be achieved in communication by reducing the bit depth assignment of the microphone signals using the greedy adaptive quantization algorithm. This estimation is based on the power consumption of the WASN hardware nodes we used to record the audio signals. We employ a simplified model for the average energy required to transmit bits from one node to the fusion centre given bywhere is the data rate in bits per second and is the average power consumed by the radio module in active status. We note that (37) provides only an approximation of the required transmission energy since it ignores some factors such as the retransmission of lost packets. However, a detailed model for the transmission energy is outside the scope of this paper. The interested reader can find more advanced methods in [30].
We will first discuss the case where quantization is performed in the time domain; that is, the bit depth assigned to the microphone signal is equal for every frequency.
The number of bits needed for the transmission of an audio frame of length samples from microphone signal can be calculated as follows:where is the bit depth assigned to the microphone signal , is the length in bits of the headers containing protocol information, and is the number of packets necessary to fit samples from according to the network protocol rules.
The radio module of the nodes we used to acquire our audio recordings consists of an IEEE 802.15.4 standard compliant radio from Atmel (AT86RF233) in combination with an ARM Cortex M4 microcontroller. In active mode, the power consumption is = 41.8 mW at = 1 Mbps. The packet in the IEEE 802.15.4 standard consists of 127 payload bytes and 6 header bytes [31]. The 127 bytes include 2 CRC bytes and 125 bytes of actual data plus headers originating from higher layers (such as, e.g., IPv6 for the network layer and UDP for the transport layer). We will assume that 25 bytes correspond to headers from higher layers. This leads to each packet carrying 33 bytes of overhead and a maximum of 100 bytes of data corresponding to audio samples. The number of packets necessary to transmit audio samples encoded with bit depth is then given byAs we have explained in Algorithm 1, when a signal is assigned 0 bits, it gets removed from the estimation process for subsequent iterations. We are interested in calculating the total energy spent in the transmission of samples per microphone signal included in the estimation process, which is given bywhere is computed using (37) and (38) and is the subset of containing the indexes of the microphone signals included in the estimation process. However, we also have to consider the messages the fusion centre needs to send to the nodes every iteration to inform them of which microphone signal will have its bit depth reduced. These messages are limited in size since only the index of the signal whose bit depth needs to be reduced has to be communicated to the nodes. The length of one fusion centre packet in bits is given bywhere we assume that the message contains one byte of payload. The energy spent in the transmission of these packages is related to the speed of refreshment of the bit depth allocation algorithm, that is, the rate at which the network performs the iterations required by the algorithm. We will denote this rate by , which is given by the inverse of the number of frames the network waits between two consecutive iterations of the bit depth allocation algorithm. A value of 1 means that we change the bit depth allocation every frame and a value of 0.5 every two frames. Following (37) the average energy per frame required to transmit the fusion centre packet is given byWe can then modify (40) to include so that the total energy spent by the network in the duration of one frame iswhere is the number of nodes in the network, which is included to account for the energy spent by the nodes in the reception of the packet. Note that it is implicitly assumed here that the energy spent in the reception of a packet is on the same order of magnitude of the energy spent for its transmission. This assumption is valid in short distances [32], which can be expected in the context of a WASN. A quick calculation of the ratio between and for , , bits (corresponding to 33 bytes), and yields roughly 5%. While this is only an approximate energy model and other concerns related to communications may arise due to the speed of refreshment, such as the use of bandwidth or the need for retransmissions, from the point of view of energy we can conclude that even for fast rates, that is, one iteration per frame, the reduction of transmission energy is not jeopardized by the refreshment rate in most situations. In practice, deciding on a value for the refreshment rate depends on the dynamics of the acoustic scenario; for example, in a scenario with moving sources it may be interesting to have a high rate to be able to track the sources, while in a static scenario a lower rate can be sufficient.
We turn our attention now to quantization with a different bit depth in each of the frequency bands. This leads to each microphone signal having a bit depth assigned for each frequency band . The number of bits needed for the transmission of complex STFT coefficients from microphone signal can be calculated following (38) aswhere is the number of frequency bins included in band and is the average number of bits assigned to microphone signal , which is given byThe number of packets necessary is now given byWe note that since each payload byte allows the fusion centre 256 combinations of channel and frequency band indexes, a packet of very similar length to the one we considered in (41) can be used in this case to let the fusion centre inform the nodes of where to remove bits. While the quantization in several frequency bands allows for extra granularity, the energy analysis shown above applies in a straightforward manner by considering the average number of bits in place of .
Finally, in Figure 11 the resulting SISNR (the same as in Figure 9) is plotted versus the total energy spent in transmission calculated from (43). Similarly, in Figure 12 we show the resulting SISNR (the same as in Figure 10) plotted versus the total energy spent in transmission calculated following the energy analysis for frequency domain quantization shown above. These graphs illustrate the estimated transmission energy savings which can be achieved through the use of the greedy adaptive quantization algorithm. For time domain quantization, from Figure 11 it can be observed that the total transmission energy can be reduced roughly by half without a meaningful loss in performance and cut by four for a small loss of 1 dB. For frequency domain quantization the savings are potentially higher since the total transmission energy can be reduced roughly to onethird without meaningful loss in performance.
5. Conclusions
We have provided a better understanding of adaptive quantization for speech enhancement in wireless acoustic sensor networks based on the previously proposed impact metric. We have done so by extending the mathematical framework of adaptive quantization in linear MMSE estimation, where we have proposed a metric based on the gradient of the MMSE and demonstrated how this metric naturally leads to a greedy approach. Moreover, we have shown that the impact metric is a generalization of the gradient metric, where the gradient is a limit case of the impact. We also propose a correction to improve the gradient metric by considering the significance of each quantization bit for different bit depths. Besides, the impact also generalizes a utility metric previously proposed for sensor subset selection. Through the use of a simulated and a real life environment we have assessed the superiority of the impact and the corrected gradient metrics over the gradient and the utility metrics due to their adaptability to the significance of each quantization bit. Besides, we have provided an estimation of the possible energy savings achievable through the use of the greedy adaptive quantization algorithm based on any of the studied metrics. In future work, an extension of this approach to a distributed speech enhancement algorithm will be explored, hence going beyond the centralized setting targeted in this work. Another important research direction will be the incorporation of psychoacoustic characteristics of human hearing to the bit depth allocation algorithm in order to improve the allocation in different frequency bands.
Appendix
The model for the effect of quantization noise on the MWF developed in Section 3.1 relies on the quantization noise being uncorrelated with the input microphone signals and with the desired speech signal components to establish equations (16) and (17). However, one might intuitively expect the quantization of microphone signal to reduce the crosscorrelation with the other microphone signals . This would lead to a decrease in the offdiagonal elements in compared to the offdiagonal elements in .
This can be considered by using an alternative model for quantization such that (11) is substituted bywhere is the diagonal matrixwith elements given bywhere . Note that this factor rescales each quantized microphone signal to its original power, since quantization might be expected not to increase the microphone signal power. The corresponding microphone signal correlation matrix is then given byAs we can observe from (A.3) and (A.4), the offdiagonal elements of the th column of are the offdiagonal elements of the th column of multiplied by , while the elements in the main diagonal of are equal to those of . In summary, models the effect of quantization as a decrease in the crosscorrelation between the microphone signals (hence the decrease in the offdiagonal elements), while their powers (given by the main diagonal elements) remain unchanged.
The crosscorrelation can be obtained by using (A.1) aswhere we have assumed that and are uncorrelated. Following (5) and (19) we can express the MMSE obtained from the MWF computed based on as Using (A.4) and (A.7) we findwhich coincides with (19), proving thatWe can then conclude from the derivation presented above that modeling the effect of quantization noise through (11) or (A.1) leads to the same MMSE and thus to the same impact and gradient metric. Therefore, there is no dilemma between the two models regarding the effect of the quantization of the microphone signals on the MWF.
Disclosure
The scientific responsibility is assumed by the authors.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research work was carried out at the ESAT Laboratory of KU Leuven, in the frame of Research Fund KU Leuven BOF/STG14005 and CoE PFV/10/002 (OPTEC) and FWO nr. G.0931.14 “Design of Distributed Signal Processing Algorithms and Scalable Hardware Platforms for EnergyvsPerformance Adaptive Wireless Acoustic Sensor Networks” and IWT SBO project SINS: Sound Interfacing through the Swarm. The authors would like to thank Steven Lauwereins for the design of the WASN nodes and Gert Dekkers for setting up the WASN and helping in the measurement process.