Center for Studies on Electronics and Information Technologies, Central University of Las Villas, C. Camajuaní, km 5.5, Santa Clara, CP 54830, Cuba
Harmonics-to-noise ratios (HNRs) are affected by general aperiodicity in voiced speech signals. To specifically reflect a signal-to-additive-noise ratio, the measurement should be insensitive to other periodicity perturbations, like jitter, shimmer, and waveform variability. The ensemble averaging technique is a time-domain method which has been gradually refined in terms of its sensitivity to jitter and waveform variability and required number of pulses. In this paper, shimmer is introduced in the model of the ensemble average, and a formula is derived which allows the reduction of shimmer effects in HNR calculation. The validity of the technique is evaluated using synthetically shimmered signals, and the prerequisites (glottal pulse positions and amplitudes) are obtained by means of fully automated methods. The results demonstrate the feasibility and usefulness of the correction.
1. Introduction
When the source-filter model of speech production [1] is assumed in Type 1 [2] signals (no apparent bifurcations/chaos), the sources of periodicity perturbations in voiced speech signals can be divided in four classes [3]: (a) pulse frequency perturbations, also known as jitter, (b) pulse amplitude perturbations, also known as shimmer, (c) additive noise, and (d) waveform variations, caused either by changes in the excitation (source) or in the vocal tract (filter) transfer function. Vocal quality measurements have focused mainly in the first three classes (see [4] for a comprehensive survey of methods reported in the previous century). The findings of significant interrelations among measures of jitter, shimmer, and additive noise [5] raised the question on “whether it is important to be able to assign a given acoustic measurement to a specific type of aperiodicity” (page 457). This ability of a measurement to gauge a particular signal attribute, being insensitive to other factors, has been a persistent interest in vocal quality research.
Harmonics-to-Noise-Ratios (HNRs) have been proposed as measures of the amount of additive noise in the acoustic waveform. However, an HNR measure insensitive to all the other sources of perturbation is, if feasible, still to be found. Methods in both time and frequency (or transformed) domain do always have intrinsic flaws. Schoentgen [6] described analytically the effects of the different perturbations in the Fourier spectra of source and radiated waveforms. According to the derivations from his models, it is not possible to perform separate measurements of each type of perturbation by using spectral-based methods. Time domain methods have been criticized [7, 8] for depending on the correct determination of the individual pulse boundaries, among many other method-specific factors.
Yumoto et al. introduced a time-domain method for determining HNR [9], where the energy of the harmonic (repetitive) component is equal to the variance of a pulse “template” obtained as the ensemble average of the individual pulses. The energy of the noise component is calculated as the variance of the differences between the ensemble and the template (see (4) in Section 2).
The original ensemble-averaging technique has been criticized [10, 11] for its slow convergence with , the number of averaged pulses. The requirement of large facilitates the inclusion of slow waveform changes in the ensemble, which are incorrectly treated as noise by the method. The sensitivity of the method to jitter and shimmer has also been reported [5], and many approaches attempting to overcome these limitations have been proposed.
In [12] the need of averaging a large number of pulses is suppressed, by determining an expression which corrects the ensemble-average HNR.
Qi et al. used Dynamic Time Warping (DTW) [13] and later Zero Phase Transforms (ZPTs) [14] of individual pulses prior to averaging to reduce waveform variability (and jitter) influences in the template. For the same purpose the ensemble averaging technique was applied to the spectral representations of individual glottal source pulses in [3], where a pitch synchronous method allowed to account for jitter and shimmer in the glottal waveforms. However, the assumptions are valid only on glottal source signals; hence results are not applicable to vocal tract filtered signals. Functional Data Analysis (FDA) has also been used to perform the optimal time alignment of pulses prior to averaging [15].
Shimmer corrections to ensemble averages HNRs have received lesser attention than pulse duration (jitter) corrections, in spite of being a prerequisite for some of the mentioned jitter correction methods. DTW and FDA, for instance, depart from considering equal amplitude pulses to determine the required expansion/compression of the waveform duration. Besides, shimmer always increases the variability of the ensemble with respect to the template in the reported methods. A normalization of each individual pulse by its RMS value was proposed in [7] to reduce shimmer effects on HNR and was first used on a method that also accounted for jitter and offset effects in [16]. This pulse amplitude (shimmer) normalization can help in the time warping of the pulses and actually reduces the variance of the template in Yumoto’s HNR formula. However, it still yields only an approximate value of HNR.
In this paper, an analysis on the original ensemble average HNR formula in the presence of shimmer is performed, which results in a general form of Ferrer’s correcting formula [12] and allows the suppression of the effect of shimmer in HNR.
2. Ensemble-Averages HNR Calculation
The most widely used model for ensemble averaging assumes each pulse representation prior to averaging as a repetitive signal plus a noise term :
This representation has been used for source [3] and radiated signals [5, 9, 14, 16] as well as for both indistinctly [12, 15]. If we denote the glottal flow waveform as , the vocal tract impulse response as , the radiation at lips as , and the turbulent noise generated at the glottis as , the components of the pulse waveform in (1) can be expressed differently for the source and radiated signals. If (1) represents the excitation signal, then , and , while for radiated signals and [17], with the asterisk denoting the convolution operation. Some important differences between both alternatives are [17] as follows.
(i)HNR measured in the radiated signal differs from HNR in the glottal signal.(ii)Jitter in the glottal signal produces shimmer in the radiated signal.(iii)Additive White Gaussian Noise (AWGN) in the glottis (a rough approximation [18] frequently assumed) yields colored noise at the lips.In the general form of the ensemble average approach, if the noise term is stationary and ergodic and and are zero mean signals (the typical assumptions in the minimization of the mean squared error [12, 19, 20]) with variances and , the actual HNR for the set of pulses is
with denoting the expected value operation. The ensemble averaging method proposed by Yumoto et al. [9] is based on the use of a pulse template as an estimate of the repetitive component :
This approximation to is then used to obtain an estimate of according to (1), and both estimates are used in (2) to produce Yumoto’s HNR formula:
The bias produced in due to the use of (3) on its calculation and the terms needed to correct it are described in [12], where it is shown that
However, the model previously described neglects the effect of shimmer when the different replicas of the repetitive signal are of different amplitude.
3. Insertion of Shimmer in the Model
To account for shimmer, a variable can be added to the model in (1):
For this model, the actual HNR is
Using the original ensemble average procedure, the template yields
and its variance is
If is uncorrelated with or any such that , the second term between brackets in (9) as well as all the products in the third term where can be suppressed:
With the inclusion of shimmer in the model, the denominator in (4) is
To simplify further derivations, the letters , and are used to represent the four terms summed and squared in (11):
Using (12), (11) can be written as
where the last five terms between brackets can be suppressed, since for any . From the first five terms, it was already shown in [12] that
The summations of the other nonzero expected values ( and ) are examined as follows:
while
and using
equation (16) yields
Finally
since
then (19) results in
The sum of (15), (18), and (21) is
Now, substituting (14) and (22) in the denominator of (4) and (10) in the numerator gives
From (23) the ratio of signal and noise variances can be determined as
and the actual HNR given by (7) can be rewritten as
Equation (25) can be simplified by using a factor defined as
and HNR expressed as
According to (26), will be a positive number ranging from one (in the no-shimmer case, being all equal) to when a single pulse is a lot greater than all the others. The latter situation is not the case in voiced signals, where the largest shimmer almost never exceeds the 50% of the mean amplitude [2] in extremely pathological voices. Equation (27) is a generalization of Ferrer’s correcting formula [12] expressed in (5), being equal in the no-shimmer case ().
4. Experiment
The calculation of (27) requires the prior determination of both pulse boundaries and amplitudes. Pulse boundaries are usually determined by means of a cycle-to-cycle pitch detection algorithm (PDA). The determination of pulse amplitudes relies on the pitch contour detected by the PDA, and a comparison of several amplitude measures can be found in [21]. In practice, the detected pulse boundaries and amplitudes differ from the real ones, causing a reduction in the theoretical usefulness of (27). An additional deterioration can be expected in the presence of correlated noise, as should be the case in radiated speech signals.
To evaluate the effects of these deteriorations, synthetic voiced signals were used with known pulse positions, noise and shimmer levels. The synthesis procedure of the speech signal is described by (28):
where is the vocal tract impulse response, denotes the convolution operation, is the variable pulse amplitude, is the glottal flow waveform, is the pulse number, is the pitch period, and is the additive noise in the signal. The effect of lip radiation has been included as the first derivative operation present in . This synthesis procedure is similar to the one used in [12, 19, 21, 22], but using a more refined glottal excitation than an impulse train. In this case, a train of Rosemberg’s type B polynomial model pulses [23] was chosen; this alternative is used in [3, 24].
The discrete implementation of (28) was performed by setting a sampling frequency of 22050 Hz, a fundamental frequency of 150 Hz (yielding 147 samples per period), and , to produce an approximate of 2 seconds of synthesized voice. The was obtained as the impulse response of a five formant all-pole filter, with the same parameters used in [12, 19, 21, 22]. The glottal flow was generated using a rising time of and a falling time of ; the values which resulted in the most natural-sounding synthesis in [23].
The shimmer was controlled by changing the value of each pulse amplitude , obtained as , where is a random real value, uniformly distributed in the interval . Eight levels of shimmer were synthesized, using values of from 0% to 47.6% in steps of 6.8%, measured in percent of the unaltered amplitude , the same values as in [12, 21].
The estimates of HNR calculated were the original ensemble average formula by Yumoto given in (4), the correction for any number of pulses given in (5), and the removal of shimmer effects given by (27). The three HNR estimates were calculated using first the known pulse durations and amplitudes, and then using the positions given by a well-known PDA (the superresolution approach from Medan et al. [19]), and the amplitudes were calculated with Milenkovic’s formula [20] using the procedure described in [21].
A base level of noise was added to the signal, to avoid values near to zero in the denominator of in (4). The variance of the noise added was chosen to produce an actual (30 dB). Two types of noise were added: AWGN, in conformity with the assumptions of uncorrelated noise made on deriving (27), and a vocal tract filtered version, having some level of correlation which is most likely the case in radiated signals.
The HNR estimates were found for ensembles of two consecutive pulses () in the synthesized signals, and the overall HNR was found as the average of these pairwise HNR’s.
5. Results and Discussion
The average value for 100 realizations of the random variables involved (noise and shimmer) was found for each HNR estimation variant on each shimmer level. It is relevant to note that the PDA detected the pulse positions without any error (not even a sample), for all realizations and all levels of shimmer. For this reason, (4) and (5) produced the same results using both the known and the detected pulse positions. Equation (27) produced different results since it involves also the calculation of the amplitude ratios among pulses, which produced results different to the values used in the synthesis.
The results for the different methods facing both noise types are shown in Figure 1, and the discussion below is first centered in the AWGN and later in the effect of the correlation present in the vocal tract filtered noise.
Figure 1: Results for the different HNR estimation methods. HNRY (in triangles) is the original formula in [
9], HNRC (squares) the pulse number correction in [
12], HNRS (plus signs) the shimmer correction proposed here (using known pulse amplitudes), and HNRSr (circles) the shimmer correction using estimated pulse amplitudes. Dashed lines represent results with AWGN; solid lines and apostrophes represent vocal tract filtered AWGN. Horizontal dashed line at 30 dB represents true HNR.
AWGN
For the zero-shimmer level the results are as predicted: the original approach (HNRY) overestimates the actual HNR (30 dB), while the corrected approaches produce adequate and equivalent results. When shimmer appears, HNRC begins to fall in parallel with HNRY, while both approaches considering shimmer, HNRS and HNRSr, show superior performance, with their values less affected by the increasing levels of shimmer.
Two relevant facts are as follows.(i)Shimmer-corrected approaches (HNRS and HNRSr) are nevertheless deteriorated by the shimmer level.(ii)There is a better performance of HNRSr in comparison with HNRS, in spite of using estimated values for the pulse amplitudes.Both facts can be explained by the presence, in any pulse of the signal, of the decaying tails of previous pulses. This summation of tails adds differences to the pulses, interpreted as noise in the model and causing a reduction in the calculated HNR as the introduced shimmer increases. On the other hand, the summation of tails in one pulse is not completely uncorrelated with the summation of tails in the other. For this reason, the estimation of relative pulse amplitudes, based in the assumption of uncorrelated noise, produces amplitudes with an overestimation of the signal component, yielding a higher HNRSr than HNRS.
It is to be expected that in the presence of jitter HNRSr will perform worse, since pulse tails would not always be aligned with the adjacent pulse, and the correlation should be lower. The evaluation of the influence of jitter (as well of other levels of noise and their combinations) in the performance of the PDA and HNRSr would require extensive tests and is out of the scope of this paper.
Vocal tract filtered AWGN
When noise is not uncorrelated as assumed in the derivation of (27), a fraction of it is regarded as signal, incrementing HNR estimates (solid lines) in all variants with respect to the results with uncorrelated noise (dashed lines). A significant fact is that this overestimation is more relevant in HNRS (plus signs in Figure 1) than in HNRSr (circles). The correlated contributions of noise and shimmered tails add to what is considered signal by the model in HNRS, while in HNRSr this effect seems to be compensated by its related consequence in estimating pulse amplitudes with the same assumptions about noise and signal correlations.
In general, shimmer corrections with estimated amplitude contours (HNRSr, in circles in Figure 1) produce the closest estimates to the true HNR, which for these experiments would be the flat horizontal line at 30 dB shown in Figure 1.
6. Conclusions
The performed analysis shows that shimmer effects can be reduced in HNR estimations based in the ensemble-averages technique using similar assumptions than in [3, 20]. The requirements for the calculation of (27) (detection of pulse positions and amplitudes) can be performed with satisfactory results using available methods.
More tests should be performed considering more types of perturbations (different noise and jitter values, as well as their combinations) as well as different vocal tract configurations. However, the experiments in this paper were performed using configurations reported in other works, and based on the preliminary results shown, the proposed approach appears to be an alternative for the estimation of HNR in the time domain superior to previous ensemble averages techniques.
Acknowledgments
This research was partially funded by the Canadian International Development Agency Project Tier II-394-TT02-00 and by the Flemish VLIR-UOS Program for Institutional University Cooperation (IUC).