Abstract

The complex cepstrum vocoder is used to modify the speaker specific characteristics of the source speaker speech to that of the target speaker speech. The low time and high time liftering are used to split the calculated cepstrum into the vocal tract and the source excitation parameters. The obtained mixed phase vocal tract and source excitation parameters with finite impulse response preserve the phase properties of the resynthesized speech frame. The radial basis function is explored to capture the nonlinear mapping function for modifying the complex cepstrum based real and imaginary components of the vocal tract and source excitation of the speech signal. The state-of-the-art Mel cepstrum envelope and the fundamental frequency () are considered to represent the vocal tract and the source excitation of the speech frame, respectively. Radial basis function is used to capture and formulate the nonlinear relations between the Mel cepstrum envelope of the source and target speakers. Mean and standard deviation approach is employed to modify the fundamental frequency (). The Mel log spectral approximation filter is used to reconstruct the speech signal from the modified Mel cepstrum envelope and fundamental frequency. A comparison of the proposed complex cepstrum based model has been made with the state-of-the-art Mel Cepstrum Envelope based voice conversion model with objective and subjective evaluations. The evaluation measures reveal that the proposed complex cepstrum based voice conversion system approximate the converted speech signal with better accuracy than the model based on the Mel cepstrum envelope based voice conversion.

1. Introduction

The voice conversion (VC) system extracts the features of the source and the target speaker sound’s and formulates the mapping function to modify the features of the source speaker sound’s such that the resynthesized speech sound’s as if spoken by a target speaker [1]. Application of VC includes the personification of text to speech, design of multispeaker based speech synthesis system, audio dubbing, karaoke applications, security related system, the design of speaking aids for the speech impaired patient, broadcasting, and multimedia applications [24]. The VC involves the transformation of speaker specific characteristics such as vocal tract parameters, source excitation, and long term prosodic parameters with that of desired speaker parameters [5]. The vocal tract parameters are relatively more prominent for identifying the speaker uniqueness than the source excitation [5].

Several methods have been reported in the literature to characterize the spectrum of the speech frame, namely, Formant Frequency (FF), Formant Bandwidth (FBW) [1], Linear Predictive Coefficients (LPC) [6], Reflection Coefficients (RC) [7], Log Area Ratio (LAR) [8], Cepstrum Coefficients [9], Mel cepstrum envelope (MCEP) [10], Wavelet Transform (WT) [11], and Mel generated spectra [12]. Line Spectral Frequency (LSF) [13, 14] is a direct mathematical transformation of LPC, which has a special attraction in representing the vocal tract as it smoothly traces the shape of formants and antiformants and overcomes the interpolation, quantization, and stability issues of the LPC. However, LP related features does not assume nonstationary characteristics of the speech signal within a frame and therefore fail to analyze the local speech events accurately [15]. Further, a very accurate approach STRAIGHT [16] has also been proposed. It needs enormous computations and therefore, it is inappropriate for real time applications. Another approach using Mel Frequency Cepstrum Coefficients (MFCC) have been proposed [17], which properly model both spectral peaks and valleys. However, the main toil of MFCC synthesis is to loose pitch and phase related information [17].

The conventional parametric speech production model like LPC, real cepstrum [1820], and Liljencrants-Fant (LF) [21] models is based on minimum phase model with infinite impulse response [22]. In fact, a completely different category of glottal flow estimation relies on the mixed-phase model of speech [22, 23]. According to this estimation, the speech signal is composed of both maximum (i.e., anticausal) and minimum phase (i.e., causal) components. The return phase of the glottal pulse components and vocal tract impulse response is part of minimum phase signals, whereas the open phase of the glottal flow is considered as maximum phase of the signal [24]. It has been shown in the literature that the mixed phase models are appropriate for representing the voiced speech [25]. The real cepstrum with minimum phase discards the glottal flow information of speech. However, the complex cepstrum incorporates phase as glottal pulse information during speech synthesis [25]. The complex cepstrum representation of the speech signal allows noncasual modeling of short time speech frame, which is actually observed in natural speech [2224]. Complex cepstrum perform well in speech synthesis and speech modeling [25, 26].

For the development of appropriate transformation model, various mapping functions have been proposed in the literature such as Vector Quantization (VQ) based code-book mapping [6] and Gaussian Mixture Model (GMM) based transformation models [3, 9, 10]. Fuzzy vector quantization [27] and a Speaker Transformation Algorithm using Segmental Code-book (STASC) have been proposed to overcome limitations of VQ based model [14]. In addition Dynamic Frequency Warping (DFW) [28] have also been used for transformation of the spectral envelope. The GMM oversmoothing issue is resolved via maximum likelihood estimators and hybrid methods [29]. The dynamic kernel partial least square regression technique has also implied [12] for spectral transformation. In fact, the relation between the shapes of the vocal tracts of the different speakers are highly nonlinear, to capture this nonlinearity between the vocal tracts artificial neural network has been explored in the literature [10, 11, 14, 18, 30].

In addition to vocal tract, the source excitation contains vital speaker-specific characteristics [1, 3], so it is necessary to properly modify the excitation signal to accurately synthesize the target speaker’s voice [4]. Very few methods have been discussed in the literature for excitation signal transformation such as residual copying, but the converted sound seems to be a third speaker’s voice [31], another method is residual prediction [3]. However, it has the problem of over smoothening. In order to alleviate the over smoothening problem of residual prediction, residual selection method, unit selection method [31], and combination of residual selection and unit selection have been also explored in the literature [32]. The Artificial Neural Network model has also applied to modify the residual signal but time domain residual transformation loses the correlation in the speech production model which leads to distortion in speech signal [12].

In this paper, the prominent complex cepstrum vocoder is employed to model the vocal tract and source excitation of the speech. The low time and high time lifters are designed to separate the complex cepstrum into vocal tract and source excitation parameters with real and imaginary components. The reasons behind the use of radial basis function (RBF) based the transformation model are its fast training ability, desirable computational efficiency, and interpolation property. The RBF based mapping function are trained separately to capture the nonlinear relations for modifying the real and the imaginary components of cepstrum based vocal tract and source excitation of the source speaker to that of the target speaker utterance’s. Similarly, the MCEP parameters of source speaker’s utterances are also modified according to the target speaker’s utterances using RBF. The fundamental frequency between source and target speaker’s utterances is modified using mean and standard deviation approach [10]. Mel log spectral approximation (MLSA) filter [33] is used to reconstruct the speech signal from modified MCEP and fundamental ().

Finally, the performance of the proposed complex cepstrum based VC approach is compared with MCEP [34] based VC approach. This is done using various objective measures such as a performance index () [3], formant deviation [14, 30], and spectral distortion [14]. The commonly used subjective measures such as Mean Opinion Score (MOS) and ABX verify the quality and speaker identity of the converted speech signal.

This paper is organized as follows. Section 2 describes the complex cepstrum analysis with low time and high time lifters which are used to extract the cepstrum based features of the vocal tract and excitation based signals. Section 3 explains the proposed VC system based on complex cepstrum and the state-of-the-art MCEP based VC system. Radial basis based spectral mapping is described in Section 4. The experimental environment, database, and objective measures, such as performance index, formant deviation, spectrograph, and the perceptual tests, namely, Mean Opinion Score (MOS) and ABX, conducted with different human listeners are presented in Section 5. The last Section gives the overall conclusions of the paper.

2. Complex Cepstrum Analysis

According to the source-filter model of the human speech production system, the source signal excites the vocal tract and it generates the speech signal. The human speech is two-sided real and asymmetrical in nature. Hence, a mixed phase Finite Impulse Response (FIR) system may be realized which preserves the phase related information to give more accurate synthesized speech. From the signal processing point of view, the short time speech signal can be considered as linear convolution of the source excitation with the impulse function of the vocal tract . It can be defined as follows: By applying DTFT to the speech signal we obtain where is the order of cepstrum, that is, number of one sided frequencies. The time domain convolution can be modeled as spectral multiplication of the vocal tract filter response and source excitation response giving the short time speech spectrum as shown, Cepstral analysis includes transforming the multiplied source excitation and vocal tract responses in the frequency domain into linear combination of the two components in the cepstral domain. The analysis of the speech signal needs to separate two components and . In frequency domain logarithmic representation is used to linearly combine the components and . The complex spectrum can be rewritten by performing logarithmic compression Therefore the log spectrum is further separated into two parts Thus, the log spectrum can be decomposed as addition of magnitude and phase components The imaginary part of the logarithmic spectrum is the unwrapped phase sequence [23]. Thus, phase information is no more ignored giving rise to a complex cepstrum. Hence comprising of a mixed phase system, with a finite impulse response (FIR) type, which is stable. The cepstrum is defined as where can be given as The log spectral components that vary rapidly with frequency are denoted as a high time component and the log spectral components that slowly with frequency are designated as a low time component [20]. Here, is time aliased version, therefore, condition avoids aliasing effect; is total number of cepstrum samples.

Consider where the represents complex cepstrum of speech frame, is low time lifter, is high time lifter. In the de-convolution stage an appropriate value of lifter index is chosen to separate the two components, namely, the fast changing excitation parameter and the slowly changing parameters, that is, vocal tract parameter . The windowed signal, the complex cepstrum with magnitude, and phase spectra are shown in Figure 1. The coefficient, is the speech signal energy and the coefficients for signifies the magnitude and phase at the quefrency in the spectrum. The vocal tract cepstrum has coefficients with significant magnitudes at lower values of and source excitation cepstrum; has relatively lower magnitude coefficients for higher values of . Theoretically, the complex cepstrum being a mixed phase results in a more accurate model of the speech signal, when compared to the minimum phase synthesis filter approach which discard the glottal flow information content in the cepstrum [18]. The cepstrum values lower than zero represents the maximum phase (i.e., anticausal) response, whereas the values above zero can be considered as the minimum phase (i.e., causal) response are shown in Figure 2. Mathematically, it can be modeled as The anticausal and casual cepstrum parts with the corresponding magnitude and phase spectrum are shown in Figure 3. It has been observed that the logarithmic compression involved in the cepstrum analysis helps in obtaining the mixed phase response for both voiced as well as unvoiced signals.

3. Voice Conversion Framework

In this section, the complex cepstrum based VC algorithm is proposed. The MCEP-MLSA based VC algorithm is also developed for comparing the performance with the proposed algorithm.

3.1. Proposed Complex Cepstrum Vocoder Based VC

The proposed algorithm is implemented in two distinct phases: (i) training and (ii) transformation phase, as depicted in Figure 4. In the training phase, the input speech signal of the source and target speakers are normalized and silence frames are removed. The normalized speech frame is represented using homomorphic decomposition. It takes the advantages of the logarithmic scaling and the theory of convolution. The low time portion of the complex cepstrum can be approximated as a vocal tract impulse response (VT), where as high time portion of the complex cepstrum is considered as source excitation (GE) of the speech frame. The length of the rectangular lifter is chosen with regard to the accuracy of the vocal tract model and sampling frequency. Thus, the cepstrum frame is split into vocal tract impulse response and source excitation of the speech using low time and high time liftering, respectively. Even if the source and the target speaker utter the same sentence, the length of their feature vectors may be different so dynamic time warping is used to align these feature vectors. The separate RBF based mapping functions are developed for modifying the cepstrum based real and imaginary components of the vocal tract and source excitation of the source speaker according to the target speaker.

In the transformation phase followed by training phase, the parallel utterance of the test speaker speech is preprocessed to derive vocal tract and source excitation feature set based on cepstral analysis. The test feature vectors are projected to the trained RBF, in order to obtain the transformed feature vectors. The time domain features are computed by inverse transforming complex cepstrum based parameters. The modified speech frame is reconstructed by convolving the transformed vocal tract and source excitation. The similar process is adapted for all remaining frames. The overlap and add method is used to resynthesize speech from modified speech frames. Finally, the speech quality is enhanced through the postfiltering, applied to the modified speech. Figure 4 depicts the training and testing phase details of the proposed approach. The resynthesized speech from the complex cepstrum has higher perceptual quality than the speech signal constructed from the real cepstrum.

3.2. Baseline Mel Cepstral Envelope Vocoder Based VC

Figure 5 depicts a block diagram of a VC system using baseline features. During the analysis step, the MCEPs are derived as spectral parameters and the fundamental frequency () is derived as excitation parameter for every 5 msec [10]. As discussed in the earlier section the feature sets obtained from the source and target speakers usually differ in time duration. Therefore, the source and target speaker’s utterances are aligned using DTW. The feature set captures the joint distribution of source and target speaker using RBF to carry out VC. The excitation features () use the cepstrum method to calculate the pitch period for the frame size of 25 msec resulting into 25 MCEP features. Mean and standard deviation statistics are obtained from and used as feature set. In the testing phase, the parallel utterances of test speaker are used to obtain the feature vector with the procedure similar to that of the training set feature vector. In order to produce transformed feature vector, the test speaker feature vector is projected through the trained RBF model. In the synthesis stage, the transformed MCEP and are passed through the MLSA [10, 33, 35] filter. The postfiltering applied to the transformed speech signal ensures its high quality.

4. Radial Basis Function Based VC

The RBF is used to model the nonlinearity between the source and the target speaker feature vectors [11]. It is a special case of feed forward network which nonlinearly maps input space to hidden space followed by a linear mapping from a hidden space to the output space. The network represents a map from dimensional input space to dimensional output space written as . When a training dataset of input output pairs []; is applied to the RBF model; the mapping function is computed as where is a norm usually Euclidian and computes the distance between applied input and training data point and is the set of arbitrary functions known as radial basis functions. The commonly considered form of Φ is Gaussian function defined as RBF neural network learning process includes training and generalized phase. The training phase constitutes the optimization of basis function parameters using input dataset to evaluate -means algorithm in an unsupervised manner [11]. In the second phase, hidden-output neurons weight matrix is optimized by the least square sense to minimize the squared error function using the equation where is desired value for th output unit when input to the network is . The weight vector is determined as where : matrix of size (), : matrix of size (), and : transpose of matrix : where represents the pseudoinverse of matrix and denotes the target matrix for . The weight matrix can be calculated by linear inverse matrix technique and used for mapping between the source and target acoustic feature vector. The exact interpolation of RBF is acquainted with two serious problems, namely, (i) poor performance for noisy data and (ii) increased computational complexity. These problems can be addressed by modifying two RBF parameters. The first one is the spread factor which is calculated as The selected spread factor confirms that the individual RBFs are neither wide nor narrow. The second one is an extra bias unit which is introduced into the linear sum of activations at the desired output layer to compensate for the difference between the mean over the data set of the basis function activations and the corresponding mean of the targets. Hence, we achieve the RBF network for mapping as In this work RBF neural networks are initialized and best networks are developed to obtain the mapping between the cepstral based acoustic parameters of the source and the target speakers. The trained networks are used to predict real and imaginary components of the vocal tract and source excitation of the target speaker’s speech signal. In the baseline approach, the MCEP based feature matrices of the source and target utterances with the order of 25 are formed. Radial basis function is trained to obtain best mapping function. The best mapping function is obtained using RBF network and used to predict the MCEP parameters of the target speaker’s speech signal.

5. Experimental Results

In this paper, the RBF based mapping functions are developed using CMU-ARCTIC corpus. The corpus consists of different sets of 1132 phonetically balanced parallel utterances of each speaker, sampled at 16 kHz. The corpus includes two female, that is, CLB (US Female) and SLT (US Female), and five different male such as AWB (Scottish Male), BDL (US Male), JMK (Canadian Male), RMS (US Male), and KSP (Indian Male) [36]. In this work, we have made use of the parallel utterances of the AWB (M1), CLB (F1), BDL (M2), and SLT (F2) with different speaker combinations like M1-F1, F2-M2, M1-M2 and F1-F2. For each of the speaker pairs 50 parallel sentences of source and target speakers are used for VC system training and system evaluations are made using a separate set of 25 source speaker sentences. The performance of homomorphic vocoder based VC system is compared with the state-of-the-art MCEP based VC system using different objective and subjective measures.

5.1. Objective Evaluation

The objective measures provide the mathematical analysis for determining the similarity index and quality inspection score between desired (target) and transformed speech signal. In this work, performance index, spectral distortion and formant deviation are considered as objective measures.

The performance index () is computed for investigating the requirement of normalized error for different pairs. The spectral distortion between desired and transformed utterances, and the interspeaker spectral distortion, are used for computing the measure. In general, the speaker spectral distortion between signals and , is defined as where represents the number of frames, refers to a LSF order, and is the th LSF component in the frame. The measure is given as The performance index indicates that the converted signal is identical to the desired one, whereas specifies that the converted signal is not at all similar to the desired one.

In the computation of the performance index, four different converted samples of M1 to F1, F2 to M2, F1 to F2, and M1 to M2 combinations are considered. Comparative performance between cepstrum based VC algorithm and MCEP based VC is shown in Table 1. The results specified that the performance of the complex cepstrum based VC performed better than MCEP based VC algorithm.

Along with performance index, the different objective measures, namely, deviation (), root mean square error (RMSE), and correlation coefficients (), are also calculated for different speaker pairs. Deviation parameter is defined as the percentage variation in the actual and predicted formant frequencies derived from the speech frames. It corresponds to the percentage of test frames within a specified deviation. Deviation () is calculated as The root mean square error is calculated as percentage of average of desired formant values obtained from the speech segments: The error is the difference between the actual and predicted formant values. is the number of observed formant values of speech frames. The parameter is the error in the deviation. The correlation coefficient is the parameter which is to be determined from the covariance between the target () and the predicted () formant values and the standard deviations of the target and the predicted formant values, respectively. The parameters and are calculated using The objective measures, namely, deviation , root mean square error (), and correlation coefficients of M1-F1 and F2-M2 are obtained for MCEP based VC algorithm and shown in Table 2. Similarly, the Table 3 shows the measures obtained for proposed VC system. From the tables it can be observed that the between the desired and the predicted acoustic space parameters for proposed model are less than the baseline model. However, every time RMSE does not give strong information about the spectral distortion. Consequently, scatter plots and spectral distortion are employed additionally as objective evaluation measures. The scatter plots for first, second, third, and fourth formant frequencies for MCEP based VC and complex cepstrum based VC models are shown in Figures 6 and 7, respectively. Figures show that complex cepstrum VC based vocal tract envelope in term of predicted formants closely orient towards the desired speech frames formants as compared to MCEP based predicted formants. The clusters obtained using complex cepstrum based VC are more compact and diagonally oriented than that using MCEP based VC. As perfect prediction means all the data points in scatter plot are diagonally oriented in right side. The compact clusters obtained for proposed method implies its ability to capture the formant structure of desired speaker.

The transformed formant patterns for a specific frame of source and target speech signal are obtained using both complex cepstrum and MCEP based VC models and shown in Figures 8(a) and 8(b), respectively. Figure 8(a) depicts that the patterns of particular target signal closely follows the corresponding transformed signal, whereas Figure 8(b) shows that the predicted formant pattern closely follows the target pattern only for lower formants.

Figure 9(a) shows the normalized frequency spectrogram of desired and transformed speech signals obtained from M1 to F1 and F2 to M2 of complex cepstrum based VC model. Similarly, Figure 9(b) shows the spectrogram for M1 to F1 and F2 to M2 for the MCEP based VC model. It has been observed that the dynamics of the first three formant frequencies in both the algorithms are closely followed in the target and the transformed speech samples.

5.2. Subjective Evaluation

The effectiveness of the algorithm is also evaluated using listening tests. These subjective tests are used to determine the closeness between the transformed and target speech sample. The mapping functions are developed using 50 parallel utterances of the source and target speakers. Twenty-five different synthesized speech utterances are obtained from the mapping function for inter- and intragender speech conversion and corresponding target utterances are presented to twelve listeners. They are asked to evaluate their relative performance in term of voice quality (MOS) and speaker identity (ABX) with corresponding source and target speaker speech samples on a scale of 1 to 5, where rating 5 specifies an excellent match between the transformed and target utterances, rating 1 indicates a poor match, and the other ratings indicate different levels of variation between 1 and 5. The ratings given to each set of utterances are used to calculate the MOS for different speaker combinations like M1 to F1, M1 to M2, F1 to F2, and F2 to M2; the results are presented in Table 4. The dissimilarity in the length of the vocal tract and the intonation patterns of different genders is the major reason for variation in the MOS results for source and target utterances of different genders. The ABX (A: Source, B: Target, X: Transformed speech signal) test is also performed using the same set of utterances and speakers. In the ABX test, the listeners are asked to judge whether the unknown speech sample X sounds closer to the reference sample A or B. The ABX is a measure of identity transformation. The higher value of ABX percentage indicates that the transformed speech lies in close proximity of the target utterance. The results of the ABX test are also shown in Table 4.

6. Conclusion

The VC algorithm comprising of complex cepstrum, that preserves the phase related information content of the synthesized speech outcome, is presented. A mixed phase system is designed to yield far better transformed speech signal than the minimum phase systems. The vocal tract and excitation parameters of the speech signal are obtained with the help of low and high time liftering. Radial basis functions are explored to capture the nonlinear mapping function for modifying the real and imaginary parts of the vocal tract and source excitations of the source speaker speech to that of the target speaker speech. In baseline VC algorithm MCEP method is used to interpret the vocal tract whereas, the fundamental frequency () represent the source excitation. The RBF based mapping function is used to capture the nonlinear relationship between the MCEP of the source speaker to that of the target speaker and statistical mean and standard deviation is used for transformation of fundamental frequency. The proposed complex cepstrum based VC is compared with the MCEP based VC using various objective and subjective measures. The evaluation results reveal that the complex cepstrum based VC performs slightly better than the MCEP based VC model in term of speech quality and speaker identity. The reason may be the fluctuation of MLSA filter parameters with limited margins in Padé approximation. It may be unstable momentarily, when the parameters vary rapidly by contrast the complex cepstrum with finite impulse response is always stable.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.