Abstract
Voice biometrics is one kind of physiological characteristics whose voice is different for each individual person. Due to this uniqueness, voice classification has found useful applications in classifying speakers’ gender, mother tongue or ethnicity (accent), emotion states, identity verification, verbal command control, and so forth. In this paper, we adopt a new preprocessing method named Statistical Feature Extraction (SFX) for extracting important features in training a classification model, based on piecewise transformation treating an audio waveform as a timeseries. Using SFX we can faithfully remodel statistical characteristics of the timeseries; together with spectral analysis, a substantial amount of features are extracted in combination. An ensemble is utilized in selecting only the influential features to be used in classification model induction. We focus on the comparison of effects of various popular data mining algorithms on multiple datasets. Our experiment consists of classification tests over four typical categories of human voice data, namely, Female and Male, Emotional Speech, Speaker Identification, and Language Recognition. The experiments yield encouraging results supporting the fact that heuristically choosing significant features from both time and frequency domains indeed produces better performance in voice classification than traditional signal processing techniques alone, like wavelets and LPCtoCC.
1. Introduction
Unlike fingerprints, iris, retina, and facial feature, our voice is a kind of bodily characteristics that is useful in speaker identification but it remains relatively unexplored. Compared to other bodily features, voice is dynamic and complex, in the sense that a speech can be spoken in different languages, different tones, and in different emotions. Voice biometrics plays a central role in many biometrics applications such as speaker verification, authentication, and access control management. Furthermore voice classification potentially can apply to interactivevoiceresponse system for detecting the moods and tones of customers, thereby guessing if the calls are of complaints or complement, for example. More examples of voice classification have been described in our previous work [1] which attempted to classify voice data by using hierarchical timeseries clustering methods. The clustering method only separates voice data into distinct groups without knowing the labels of the groups. Voice classification method trains and tests voice data into classes of known labels.
Voice classification has been studied intensively in the biometrics research community using digital signal processing methods. The signatures of the voice are expressed in numeric values in the frequency domain. There lie considerable challenges in attaining high accuracy in voice classification given the dynamic nature in the speech data, not only the contents within but also the diversity of human vocals and different ways of speeches. In this paper we tackle the classification challenges by modeling human voices as timeseries in the form of stochastic signals. In contrast to deterministic signals that are rigidly periodic, stochastic signals are difficult to be modeled precisely by mathematical functions due to uncertainty in the parameters of the computational equations. Timeseries of voice data are nonstationary, with their statistical characteristics change over time when spoken. As far as human voice is concerned, almost all of them are stochastic and nonstationary, meaning that their statistics are time dependent or time varying.
Given such temporal data properties, human voice that is acquired continually from the time domain would be in the form of random timeseries that often has a single variable (amplitude in loudness) over time. It is believed that the statistical characteristics are changing over time during a speech but they may form some specific patterns, so some inherent information can be derived from the timeseries that are useful for classification. Specifically we adopt a recent preprocessing methodology, called Statistical Feature Extraction (SFX) [2], that can effectively transform a univariate timeseries voice data to a multivariate data while capturing the informative characteristics of the timeseries. It is known that conventional data mining models can be deployed for classifying data with only multiple attributes. Previous work by other researchers who utilized wavelet transformation essentially converted temporal data to the representation of frequency domain format. For voice classification in this paper, elements of both time domain and frequency domain are used for obtaining the statistical characteristics of the timeseries, and subsequently subject to model learning for classification that can be generically implemented by most of the available classification algorithms.
Simulation experiments are carried out over four representative types of voice data or speeches being digitized for validating the efficacy of our proposed voice classification approach based on SFX and metaheuristic feature selection. This type of feature selection will find the optimal subset of features for inducing the classification model with the highest accuracy. The four types of testing data are deliberately chosen with the purpose of covering a wide range of possible voice classification applications, such as Female and Male (FM), Emotional Speech (ES), Speaker Identification (SI), and Language Recognition (LR). Given the multiattributes which are derived from the original timeseries via the preprocessing step, feature selection (FS) techniques could be applied prior to training a classification model. Our results indicate that superior performance could be achieved by using SFX and FS together over the original timeseries for voice classification. The improvements are consistent over the four testing datasets with respect to the major performance indicators.
The rest of the paper is structured as follows: The previous works on classifying voice data are reviewed in Section 2; specifically their timeseries transformation and feature extraction techniques are highlighted. Our proposed voice classification model which converts timeseries voice data to its encoded vector representation via statistical and spectral analysis is described in detail in Section 3. A set of comparative experiments is performed by using four kinds of voice datasets, and they are reported in Section 4. Results that reinforce the efficacy of our new approach are shown in Section 5. The performance evaluation is allrounded by considering accuracy, Kappa statistic, precision, recall, measure, ROC area under curve, and time cost for each dataset. Section 6 concludes this research work and suggests some future works.
2. Related Work
Human voice is stochastic, nonstationary, and bounded in frequency spectrum; hence some suitable features could be quantitatively extracted from the voice data for further processing and analysis. Over the years, different attempts have been made by previous researchers who used a variety of timeseries preprocessing techniques as well as the core classification algorithms for extracting acoustic features from the raw timeseries data. Their performances, however, vary.
2.1. Feature Extraction on Voice Data
Some useful features selected for the targeted acoustic surveillance are [3] weighted average delta energy (), LPC spectrum flatness (), FFT spectrum flatness (), zero crossing rate (), harmonicity (), midlevel crossing rate (), and peak and valley count rate (). The classifier model used by the authors is the sliding window Hidden Markov Model (HMM). They obtained an average error rate at the range of 5%–20%. Peeters discovered more detailed acoustic features for sound description [4]. These features can be roughly grouped into temporal, energy, spectral, harmonic, perceptual, and various features. The limitation is the expensive time and space costs of computation for such full kind of feature extraction.
In the research community of signal processing, the most widely used methods for voice/speech feature extraction are Linear Prediction Coding or Linear Prediction Coefficient (LPC), Cepstral Coefficient or Cepstrum Coefficient (CC), and Mel Frequency Cepstral Coefficient (MFCC). LPC consists of finding a timebased series of pole infinite impulse response (IIR) filters whose coefficients better adapt to the formants of a speech signal. The main idea behind LPC is that a sample of speech can be approximated as a linear combination of past speech samples [5]. The methods for calculating LPCs include covariance method, autocorrelation (Durbin) method, lattice method, inverse filter formulation, spectral estimation formulation, maximum likelihood method, and inner product method [6].
As a general practice of pattern recognition, the final predictor coefficients are never applied because of the high variance. Instead, cepstral coefficients [7] are introduced for transforming the LPC predictor coefficients to those with more robust property. Cepstral coefficients are the inverse Fourier transform representation of the log magnitude of the spectrum. The cepstral series represents a progressive approximation of the envelope of the signal [8]. MFCC offers the best performance within six coefficients (the other five coefficients are Linear Prediction Coefficient, Linear Prediction Cepstral Coefficient, Linear Frequency Cepstral Coefficient, and Reflection Coefficient) [9]. MFCC divided the speech into frames (typically 20 ms for each frame), applied Discrete Fourier Transformation over every frame, retained the logarithm of the amplitude spectrum, smoothed the spectrum, and applied Discrete Cosine Transform [10]. Several modified MFCC methods are shown having better performance in some cases. One of them is weighted MFCC. To reduce the dimensions of feature vector while still retaining the advantages of delta and double delta features, the weighted MFCC coefficients equal the sum of MFCC coefficients, times Delta features and, times double Delta features, where and are weights in real numbers [11]. An enhanced technique for feature recognition using Improved Features for Dynamic Time Warping (DTW) was applied as a classifier; the accuracy was between 85% and 98%. Zhou et al. designed a new KullbackLeibler distance (KLD) based weighting Perceptual Linear Prediction (PLP) algorithm for MFCC. The KLD is defined as the distance of two continuous functions; it is a measure between reality distribution and approximating model . The weight is the reciprocal of this distance [12]. The word error rate was below 25%.
Similar to LPC and MFCC, PLP modifies the shortterm spectrum of the speech by several psychophysically based transformations. The basic steps of PLP contain spectral analysis, criticalband spectral resolution, equalloudness preemphasis, intensityloudness power law, autoregressive modeling, and practical considerations [13]. But PLP is vulnerable when spectral values are modified by the frequency response of the communication channel. Thus, by employing relative spectra filtering of log domain coefficients (RASTA), we make PLP more robust to these distortions [14].
Tsrrneo Nitta used multiple mapping operators to extract topological structures, hidden in time spectrum patterns. Linear algebra is the main technique. KarhunenLoeve transformation and linear discriminant analysis were the feature extraction methods [15]. The error rate was lower than 30%. Lee et al. proposed a new feature extraction method called independent component analysis (ICA). The purpose of an ICA network is to calculate and extract independent components from speech segment by training. Meanwhile, the weight matrix holds the basic function coefficients from the speech segment. One assumption of ICA is that the observation is the linear combination of the independent components [16]. The error rate was 5% at most.
Our proposed method uses both statistical and spectral analysis for extracting all the possible features. Subsequently it selects useful features via a metaheuristic search. The qualified features are then used to reduce the vector dimensionality of training instances for building a classification model. The features from the temporal domain contain richer statistical information than only local maxima and local minima. Our method rides on the observed current trend of fusing information from both time and frequency domains. The merit is that a nonlinear relationship is represented by the spectrum of a spectrum, so only the useful features from the frequency domain in addition to other strong statistical features from the timedomain are encoded into the multidimensional vector which of course is limited in space. Besides, residual and volatility are introduced and embedded into voice classification to produce superior classification result.
2.2. Data Mining Algorithms for Voice Classification
Some recent research tapped on the power of data mining algorithms for performing voice classification in various applications. For instance, a new method is proposed by the research team of Lee et al. [17], for prescribing personalized medicine using vocal and facial features. It is a constitution diagnostic method based solely on the individual’s physical characteristics, irrespective of psychological traits, characteristics of clinical medicine, and genetic factors. They used Support Vector Machine (SVM) on a software package called LIBLINEAR (L2loss SVM dual type) for doing voice classification.
As a contribution to telemedicine in home telemonitoring, Maunder et al. [18] investigated the possibility of automatically detecting the sound signatures of activities of daily living of an elderly patient using nonintrusive and reliable methods. A Gaussian mixture model (GMM) classifier was used to differentiate sound activities. Their experiments yielded encouraging results; with recognition accuracies in the range 70% to 100% can be consistently obtained using different microphonepair positions, under all but the most severe noise conditions.
For biomedical applications, Chenausky et al. made an important contribution in acoustic analysis of Parkinson’s disease (PD) speech [19]. The speech of 10 PD patients and 12 normal controls was analyzed for syllable rate and variability, syllable length patterning, vowel fraction, voiceonset time variability, and spirantization. These were normalized by the controls’ standard deviation to represent distance from normal and combined into a composite measure. A feedback device that was developed from these findings could be useful to clinicians adjusting deep brain stimulation (DBS) parameters, as a means for ensuring they do not unwittingly choose DBS settings which impair patients’ communication.
In our previous work in [1], surveyed several approaches have been studied, such as Artificial Neural Networks (ANN), Support Vector Machines (SVMs), Hidden Markov Models (HMMs), and Gaussian Mixture Models (GMMs). They have been used for training up a classification model with predefined voice samples for voice recognition. A summary of the techniques by which majority of research works used was shown in [1]. In particular, an approach by using unsupervised clustering was described in [1], where priori labeled samples are not required, and the characteristic groupings will be dedicated by the samples themselves. Voiceprints who share similar features will be placed into distinctive groups that represent some labels about the speakers. Subsequently a decision tree (classifier) can be built after studying and confirming the characteristic groups.
Above all the methods a forementioned, encoding techniques from the frequency domains are used as sole features for modeling the voice samples. A single classification algorithm was used specifically for conducting the validation experiment in the literature. In this paper, we advocate combining features from both time and frequency domains, for a throughout coverage of all the voice data characteristics. Then feature selection is used to reduce the dimensionality of the training samples. This way, a minimum subset of relevant features is ensured, and they could be applied into most types of classification models without any limit of a specific type.
3. Proposed Method in Constructing a Voice Classification Model
The SFX preprocessing methodology that is adopted in our research is efficient. Its main merit lies in its ability to transform voice data from onedimensional to multidimensional features. The SFX technique could possibly fit into a standard data mining process, like the one shown in Figure 1. The training dataset in a form of timeseries get converted to multidimensional vectors via the preprocessing process, ready to be used for training a classification model. Given a large dimensionality, ensemble feature selection could be applied over the converted multidimensional vectors for refining the accuracy by retaining only some selected relevant features. In our case, a metaheuristic search method seems to perform very well given its efficient stochastic optimization. Its operational nature is dynamic, suitable for choosing features on the fly, considering that voice data could be potentially continuous.
The model construction process is just a standard classification model learning in data mining; for example, a decision tree is built by creating decision paths that map the conditions of the attribute values, as seen from the training samples, to the predicted classes. Once a classifier is trained by processing through the whole training dataset, it is ready to classify new unseen testing samples, and its performance can be measured. The feature selection process is generalized enough to be an ensemble where the winner takes all. During calibration, several feature selection algorithms are put into test, and the best performing one in our case is Feature Selection with Wolf Search Algorithm (FSWSA) [20]. The other unique contribution by this paper is the extraction of features from the timeseries via piecewise transformation, in addition to the metaheuristic feature selection algorithm. The main difference between our innovation and the others is highlighted in red in Figure 1. We zoom into the details of preprocessing describing the operational flow from data perspective in Figures 2 and 3, respectively, for SFX with and without FS.
In a nutshell, the preprocessing methodology SFX is a way of transforming a twodimensional timeseries (amplitude versus time) into a multidimensional feature vector that has all the essential attributes sufficient to characterize the original timeseries voice data. Information is taken from two domains, frequency and time, based on the original timeseries. Thus there are two groups of preprocessing techniques being used here, namely, LPCtoCC encoding (from the frequency domain), Descriptive Statistics of both whole and piecewise, and Dynamic Time Wrap (from the time domain). It is believed that having features obtained from both domains would yield an improved accuracy from the trained classification model due to thorough consideration of the characteristics, hence the representative features, from both domains.
Effectively the preprocessing methodology SFX transforms a matrix of original timeseries to a set of training instances which have specific attribute values for building a classification model. Assume (shown in Figure 2 after the wave read process) is an archive of timeseries, with each row containing a th timeseries , and is an ordered sequence of variables such that where is the length of the timeseries over different time points and is the number of instances in the data archive .
is then to be transformed to a structured training dataset in which each row is an instance that is defined by a finite number of attributes , such that ) where and . is the th attribute in , is a vector of known target values of ; thus is the th target value to which the attribute values of are able to map. The target labels are assumed to be known priori in (supervised learning), and their values are just carried over from to , instance for instance, by the same order of .
The attributes , however, are obtained from the dual timefrequency domains which can be briefly grouped as where the instance is made of two components that are derived from frequency and time domains, respectively. From the frequency domain alone attributes are extracted, attributes taken from the time domain, and .
3.1. Feature Extraction from the Frequency Domain
Linear Prediction Coefficients to Cepstral the Coefficients, or Linear Prediction Coding to Cepstrum Coefficients (LPCtoCC) is selected as the main feature extraction method from the frequency domain in our case. The common production process of human voice contains the following steps of voice generation: the lungs expel air up, acting as the initial step of voice production. Then the air goes into the trachea, passing through the larynx. The larynx is a boxlike organ and has two membranes named vocal folds. The voice is actually produced by the vibration of those vocal folds [21]. The acoustic theory of voice production assumes the voice production processes to be a linear system. The output of a linear system is produced based on the linear combination of its previous outputs and current and previous inputs [22]. It is the reason that LPC is chosen here for the purpose of encoding the voice data.
Linear prediction calculates future values of a signal in discrete time format based on a linear function of previous samples. It is always called linear prediction coding, which is a common tool widely used in speech processing for representing the spectral envelope of a signal in compressed form [23].
The original timeseries voice data is windowed by multiplying a windowing sequence via a hamming method, such that where is the window size. It predicts the next values of points as a linear combination of previous points’ values. The predicted points with a th order of prediction are as follows: where is linear predictor coefficients of the th order. Figure 4 shows a sample of the predictor coefficients.
The problem of value setting of prediction order determines the characteristics of the vocal filter. If is too low, then key areas of resonance will disappear; if is too high, then characteristics of source are missed. Two complex conjugate poles are needed for characterizing correct formants. Thus, in the signal bandwidth, should be two times of formants number. Suppose is the signal’s sampling frequency, and is usually determined as follows: where is the compensation for glottal rolloff and predictor flexibility, which is normally set to be 2 or 3 [24]. The sampling frequency is usually 10 kHz, so the value of is approximately 12 to 13.
The prediction error generated by this estimate method is the difference between the actual and the predicted values: and we define the error metric for the multidimensional signals as The expected value of the squared error is minimized, yielding the following equation: where is the autocorrelation sequence of signal .
The autocorrelation sequence can then be represented as a matrix in the format of where is a vector that contains elements of , and is the vector of predictor coefficients that holds , for . is known as a Toeplitz Matrix with the size of from which the predictor coefficients can be calculated by inverting the matrix , . Then the predictor coefficients can be used to derive the cepstrum coefficients, , for , which are the required output of LPCtoCC. The cepstrum is defined as the inverse DFT of the magnitude of the DFT of a signal: where is discrete Fourier transform and is inverted discrete Fourier transform.
When a windowed frame is applied on voice data , the cepstrum is The transformation steps are shown clearly in Figure 5.
The cepstrum has a lot of advantages such as orthogonality, compactness, and sourcefilter separation; meanwhile the LPC coefficients are much more susceptible to the precision of numerical numbers, which are less robust than cepstrum coefficients [25]. Thus it is often desirable to transform LPC into CC : Above all, the transformation converts the original timeseries to a linear prediction coefficient vector defined by and then converts this vector to a cepstrum coefficient vector defined by . The cepstrum coefficient vector is ready to form a part of the descriptive features, as where .
3.2. Feature Extraction from the Time Domain
Here we have a feature set that is characterized by a collection of attribute extracted from the timeseries of the voice raw data with respect to the time domain. The statistical attribute extraction method has been commonly used by many researchers in the area of digital signal processing, biosignal analysis, and so forth.
3.2.1. Descriptive Statistics
The extracted statistical features include the following statistics: Mean, Standard Deviation, 1st Quartile, 2nd Quartile, 3rd Quartile, Kurtosis, Interquartile Range, Skewness, RSS (residual sum of squares), Standard Deviation of Residuals, Mean Value of Volatilities, and Standard Deviation of Volatilities. Suppose is a raw voice data with sampling points, is the residual array, and is the volatility array.
Mean: Standard deviation: Quartiles: (see Figure 6).
Kurtosis: A standard normal distribution has the Kurtosis value of three. As the result, the next definition of kurtosis is widely used and it is often known as excess kurtosis: Interquartile range: Skewness: In the statistical analysis of the timeseries data, Autoregressive Moving Average models (ARMA) describes a stationary stochastic process based on two polynomials, one for the Autoregression (AR) and the other for Moving Average (MA) [26]. With the parameter settings this model is usually notated as where is the order of the AR part and is the order of the MA part.
Now we introduce another model for characterizing and modeling observed timeseries: autoregressive conditional heteroskedasticity (ARCH) model. So that in the model, at any time point in this sequence, it will have a characteristic variance.
If an ARMA model is supposed for the build of error variance, then the model is a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model [27]. With the parameter settings this model is usually referred to as the where is the order of the GARCH terms and is the order of the ARCH terms :
We set the parameters of GARCH model with standard values such as the following. Distribution = “Gaussian”; variance Model = “GARCH”; (model order of ) = “1”; (model order of ) = “1”; (autoregressive model order of an model) = “1”.
RSS: Standard deviation of residuals: Mean value of volatilities: Standard deviation of volatilities:
3.2.2. Dynamic Time Warping Distance
Though descriptive statistics may give us the overall summary of timeseries data and characterize a general shape of timeseries data, they may not be able to capture the precise trend movements which are also known as the patterns of evolving lines. In particular we are interested in distinguishing the timeseries which belong to one specific class from those that belong to another class. The difference of trend movements can be estimated by a technique called Dynamic Time Warping.
Dynamic Time Warping (DTW) is an algorithm for measuring similarity between two timeseries in the situation that both have similar shapes but they vary in time step or speed rate. DTW has been applied to many data objects like video, voice, audio, and graphics. Actually, DTW can explain and deal with any ordered set of data points by the format of linear combination [28].
In theory, DTW is most suitable for voice wave patterns because exact matching for such patterns often may not occur, and voice patterns may vary slightly in the time domain. DTW finds an optimal match between two sequences that allows for compressed sections of the sequences. In other words it allows some flexibility for matching two sequences that may vary slightly in speed or time. The sequences are “warped” nonlinearly in the time dimension to determine a measure of their similarity independent of certain nonlinear variations in the time dimension. Particularly suitable DTW is for matching sequences that may have missing information or various lengths, on condition that the sequences are long enough for matching.
Suppose that , represents an instance in timeseries archive , the number of instances in is . , means each class label to which every instance belongs, where is the number of class labels. , is the th target value to which the attribute values of are able to map. , is the number of target values in each class . For any in timeseries archive , the DTW distance of to its own class is defined as Note that the count upper limit is because the DTW distance between and itself is 0 by the definition (they have the exactly same shape). The DTW distance of to another class to which it does not belong is So the number of distance attributes equals the number of , that is, how many classes in total. These distance attributes compose a member of features in , which represents the extracted features of a whole timeseries raw data in time domain. Figure 7 visually illustrates this concept of distance in DTW computation.
3.2.3. Piecewise Transformation
So far along the timedomain, statistics are extracted from the whole piece of the timeseries as well as the similarity in terms of distance between the test timeseries and the mean of its peer group. For a finer level of information, a piecewise transformation is applied which is called Piecewise Linear Function (PLF). A continuous timeseries is converted into a collection of linear segments when PLF is applied on it. The purpose of this compressed expression method is to approximate a polynomial curve into a vector of finite dimensional Euclidean space that consists of quantitative values.
This is the key part of the research work because it contains our new contribution. Inspired by the financial analysis of stock market, residual and volatility are firstly imported in the application field of voice classification. Like historical volatility for one or more stocks over some specified trading days, we also believe that certain patterns of someone’s speech are involved in residual and volatility.
Each sentence is read by wavread function in MATHLAB into a one dimension array as illustrated in Figure 2. The starting and ending points of every timeseries data are just the same as the beginning and ending points of each array, which means that all information is used without any redundancy. The depth of segmentation can be selected arbitrarily but sufficiently by the user. In our experiments, the average length of a sentence is ten words, and each word has a peak correspondingly. The mean length of the sampled timeseries array is 100 k points. Without compromising the resolution and the complexity of feature space, we choose to be 20, thus we can cut a peak into two parts which represents up and down gradients. Then the continuous timeseries voice data is partitioned equally into 20 pieces.
In our experiment, we try to keep the length of every spoken sentence the same, being almost 10 k points after sampling. The number of segmentations is also 20, so each piece maintains at nearly 5 k sampling points. For each segment of the timeseries, certain statistics that describe the trend and dynamics of the movement are extracted into the feature vector, that is, . An example of the timeseries segmentation in normal and stretched view is shown in Figure 8.
(a)
(b)
Using this piecewise method, the features that are being extracted are statistics of each partition of the timeseries. Table 1 shows a list of all statistics that can potentially be harvested from 20 partitions of a particular timeseries. The definitions of the statistics parameters then follow.
For each segment , , is the number of points each segment contains, that is, .
Gradient of : where is RSS of : Standard deviation of residuals of : Mean value of volatilities of : Standard deviation of volatilities of : The model for residual and volatility is also selected as GARCH model, where the parameters of GARCH model are configured the same as previously: mentioned Distribution = “Gaussian”; Variance Model = “GARCH”; (model order of ) = “1”; (model order of ) = “1”; (autoregressive model order of an model) = “1”.
A calibration test is used to determine the optimal choice of the length of each piece (interval) such that the highest classification accuracy can be obtained. Different numbers of intervals have been tried continually for piecewise transformation, extracting the corresponding attributes and running the classifiers. As the results shown in Figure 9, it was found that using 20 segments of each length yields the highest classification accuracy. The test was done preliminarily without FS and the results are averaged over all parameters.
4. Experiment
In order to compare the effectiveness of the proposed timeseries preprocessing method with the other existing methods, we test them on four different voice/speech datasets using nearly twenty popular and traditional classification algorithms in data mining.
4.1. Data Description
Four representative types of voice data are tested by the simulation experiments; they are Female and Male (FM) Dataset, Emotional Speech (ES) Dataset, Speaker Identification (SI) Dataset, and Language Recognition (LR) Dataset.
4.1.1. Data Sources
FM. The FM dataset is downloaded from School of Information Technology and Electrical Engineering (ITEE), University of Queensland, Australia, called VidTIMIT AudioVideo Dataset [29]. The dataset is made up of audio recordings of recited short sentences from 43 volunteers, among which 19 are females and 24 are males. It is from the test section of TIMIT corpus that all those sentences were selected. 10 sentences for every speaker. The first two sentences are all the same for each speaker, with the remaining eight that differ according to every individual. Here only the audio data is concerned and video data is discarded.
ES. The ES dataset comes from the database of German emotional speech, developed at the Technical University, Institute for Speech and Communication, Department of Communication Science, Berlin, with Professor Sendlmeier. It was funded by the German Research Association DFG (research project SE 462/31) [30]. The aim of the database is to examine acoustical correlates of emotional speech. It is comprised of seven basic emotions (anger, happiness, sadness, fear, disgust, boredom, and neutral) and only four major emotions are taken for the purpose of simplification. Ten professional native German actors with balance gender distribution (5 for each) produced these emotional speeches, which containing 10 sentences with 5 short sentences and 5 longer ones.
SI. The SI dataset is taken from the PDA speech database, owned by Yasunari Obuchi in March 2003, Carnegie Mellon University (CMU). The recording was done by CMU students and staff [31]. There recording was done by using one PDA with four small microphones mounted around and one big microphone in the record room. The type of that big microphone was an Optimus Nova 80 closetalk microphone. The type of small ones was Panasonic WM55DC2 and they were mounted using a mockup shown below. There are 16 speakers and each read about 50 sentences.
LR. The LR dataset is generated through an approach called speech synthesis. The speech synthesizer software used here is Microsoft TexttoSpeech engine with many expansion packages [32]. Sentences of English, Cantonese, and Mandarin were widely selected from the area of frequently used daily conversations, daily news, educational reports, stories, scientific articles, ancient proses, poems and poetries, and so forth.
4.1.2. Data Formats
The voice data is in the format of twodimensional timeseries, with an amplitude value in sound that varies over time; examples are given in Figures 8(a) and 8(b). The sampling rate or frequency of wave read process is 10 kHz. Group distributions of distinctive datasets are given in Table 2. The FM dataset has only two classes, which is the simplest classification task in data mining. The rest of datasets contain more than two classes that make the classification task more difficult. The numbers of attributes or features for every dataset and instances for training and testing are listed in Table 3.
4.1.3. Data Visualization
Visualization of parts of each group of the datasets, FM, ES, SI, and LR is displayed in Figures 10(a) to 10(l). Inspecting by just naked eyes, one can see some distinctive differences between the waveforms of different classes.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Multidimensional (MD) visualization of each group of those datasets is shown in Figures 11(a) to 11(b). Again, by just visual inspection, it can be observed that the voice data between different classes are apparently distinctive in the FM group and in the LR group. Common sense tells us that female speakers and male speakers have distinguishing vocal tones. Speeches of different languages also can be differentiated easily, as each language has its unique vowels and phonics. In contrast, the voice data of 16 unique speakers have certain overlaps in their feature values; this implies that some speakers share similar voices which are not something very uncommon in real life. The voice data in the emotion groups are highly mixed together by the feature values. That shows the potential computational difficulty in classification between voices of different emotions.
(a)
(b)
(c)
(d)
4.1.4. Algorithms Used in Comparison
Our experiments are performed by using popular and standard classification algorithms (with their default parameters applied) over the four sets of the abovementioned voice data that are being handled by four preprocessing methods. A total of 20 classification algorithms are being used. The justification is that we try to test the generality of our voice classification model without being attached to any specific classification algorithm. In other words, the design of the voice classification model should be generic enough, and its efficacy should be independent from the choice of classifier. While the focus of the voice classification model is centered at the preprocessing steps which leverage the features from both time and frequency domains followed by feature selection for reducing the feature space dimension, classification algorithms can become flexible plugandplay in our model design. The standard classification algorithms used in our experiments are well known in data mining research community as well as available in Weka (http://www.cs.waikato.ac.nz/ml/weka/), and they are listed in Table 4.
The four preprocessing methods used for comparison are as follows.
LPCtoCC. Only the cepstrum coefficients are used as the encoding result of timeseries voice data. Meanwhile, the LPC coefficients are ignored in final attributes set.
Wavelet. Only the 50largest Harr wavelet coefficients are taken as converting the sequence from time domain to frequency domain. The number of decomposition level of Harr wavelet transform is 3.
SFX. Statistical Feature Extraction (SFX) converts the timeseries voice data to a whole set of attributes with both frequency and time domains, using a collection of feature methods described in Section 3.
SFX + FS. Statistical Feature Extraction + Feature Selection (SFX + FS) is exactly the same as SFX except that the full set of features or attributes were filtered by using different feature reduction methods. Note that it is an ensemble feature selection method, using multiple models to obtain the best performance. Two facts are considered: mean accuracy and time cost. The compensation is made between time and accuracy, which means that we prefer a little bit lower accuracy and more on acceptable time cost. The optimal one was chosen as the final FS method.
WSA. Wolf Search Algorithm (WSA) is a bioinspired heuristic optimization algorithm [33]. It naturally balances scouting the problem space in random groups (breadth) and searching for the solution individually (depth). The pseudocode of WSA is given in Pseudocode 1.

ChiSquare. In statistics, the purpose of chisquare () test is to measure the independence of two events and . From the knowledge of probability and statistics, we know that two events are independent if the probability equation has the following relationships: and or equivalently. In feature selection, let occurrence of the term be event and occurrence of the class be event . We then rank values based on the following quantity [34, 35]: where is the whole set of observations, is the frequency actually found in , and is the expected one. At the same time, means that the document contains term , means that the document does not contain , means that the document is in class , and means that the document is not in .
CFS. An essential assumption is made before going directly into the discussion of Correlation Feature Selection (CFS). It is that good feature subsets always have highly corresponding features, whereas there are uncorrelated features among the rest of them [36]. On the basis of that, CFS starts its work and evaluates features. The merit containing features for a specific feature subset is where represents the average value of all (classification to feature) correlations, and is the mean value of all (feature to feature) correlations. Then CFS is defined as follows: where and variables are correlations just like the aforementioned.
MRMR. Maximum Relevance is normally referred to as subsets of data identified by feature selection which are relevant to the parameters. There often exist relevant but redundant components in those subsets. MRMR, known as Minimum Redundancy Maximum Relevance, however, attempts to detect those redundant subsets, find them out, and delete them. Example application fields of MRMR are but not limited to cancer diagnosis, face detection, autoresponse, and speech recognition.
Suppose , , and to be probabilistic density functions of two random variables and ; then their mutual information is defined as [37] The nature of feature selection in mutual information model is to find a feature set containing features , which also have the largest dependency on the target class . This is the definition of Max Dependency:
MaxRelevance and MinRedundancy are Out of the chosen popular feature selection algorithms that are put into test in the calibration process, we can see that WSA which is a metaheuristic FS algorithm consistently is having superior performance, except for the Speaker Identification dataset which is known for its overlaps in feature values. The testing results are shown in full in Table 5. The computing environment is on a PC workstation, with Windows 7 Enterprise Edition, 64 bits, Intel Core i7 CPU, and 8 GB RAM.
5. Results and Analysis
The objective of our experiments is to compare the performance of those four preprocessing methods on four kinds of voice datasets when a collection of data mining classifiers are applied. Our performance evaluation covers four main aspects: accuracy comparison of datasets; accuracy comparison of preprocessing methods; and overall averaged performance comparison.
Twenty popular classification algorithms were used on FM and LR datasets, which is regarded as a representative set of commonly used classifiers. However, the classifier of LibSVM could not be applied on ES and SI due to their formats. Some attribute data contain infinitely small values. Results from some classifiers are not available because of the time limitation: it takes too much time for them to build a classification model when the number of attributes gets very large. As such, LibSVM is excluded from experiments involving ES and SI. NBTree and Conjunctive Rule are excluded from experiments over the dataset SI. For feature selection, the algorithm candidate that yields the highest accuracy is used in the subsequent experiments.
5.1. Accuracy Comparison of Datasets
The accuracy of the classification result is the most significant criterion for evaluating the performance. It is defined as the percentage of correctly classified instances over the total number of instances. This section shows total accuracies of four preprocessing methods on each voice dataset. Four sets of accuracy results and box plots for different dataset are presented in Figures 12(a) to 12(d).
(a)
(b)
(c)
(d)
From the aforementioned figures we find that the first two preprocessing methods, which are wavelet and LPCtoCC, yielded a relatively nonstationary accuracy result on all four datasets. For LR dataset, wavelet method generated better result than LPCtoCC. Conversely, LPCtoCC was better for FM, ES, and SI. Recalling from Section 4.1.1, we know that only the LR dataset is synthetic, which was produced by a TexttoSpeech engine. LPCtoCC, known as a common voice encoding method, has a problem in obtaining the more realistic components: there are many transition frames that the LPC model fails to sort correctly [38]. Such inaccuracy of the model might be due to annoying artifacts like buzzes and tonal noises. So the performance was relatively worse.
Meanwhile, SFX and SFX + FS showed relatively more stable results than the first two. They really improved the accuracy a lot. By a contrast of SFX and SFX + FS, after feature selection, the main range () of accuracy distribution became narrower and the accuracy results increased.
More evident comparison result is given when the accuracies are averaged out and placed together side by side in a bar chart in Figure 13.
An interesting phenomenon is observed from Figure 13—the accuracy fell a little after SFX compared to LPCtoCC over FM dataset. However, from the methodology of SFX, we know that cepstral coefficients are involved in the attributes of SFX. This indicates that the classification accuracy may decrease when the number of attributes increases due to the redundancy of those unnecessary features [39]. FM has only binary classes; the performances of the preprocessing methods differ very little compared to those in other datasets that have multiple classes. In particular, SI has 16 different classes; the differences of performance between the preprocessing methods become obvious.
Another considerable fact is also derived from Figure 12 on LR dataset—Wavelet seemed to have a better performance than what LPCtoCC did. Besides the drawback of LPC encoding method, we can also consider other reasons. The inherent frequency of one’s speech is an important acoustic feature for identifying different individuals. Other necessary features may include behavioral patterns (such as voice pitch and speaking style) and human anatomy patterns (like the shape of throat). Remember that the result of LPCtoCC only contains 10 cepstral coefficients, and the number of target groups to be classified is 16. It contains too few information for correct classification and wavelet provides relatively sufficient features.
Considering the number of classes in each dataset together with the accuracy result, we can find that the accuracy of binary targets classification (FM) is higher than multiple targets classification (ES) and (SI) for the frequencydomain encoding methods. For the timedomain methods like SFX and SFXFS, good accuracy still can be attained in multiclass classification as in SI where the frequencydomain methods underperform.
Multiclass classification categorizes instances into more than two classes, whereby a hypothesis is constructed to make sure that discriminates can be distinguished between a fixed set of classes. An assumption is made before that, which is closed set and good distribution. If all possible instances belonging to each case fall into one of the classe, and each class contains statistically representative instances, then the performance of classification is good enough. For now, the boundary of every emotion in ES dataset is not clear (which is already shown in Figure 11(b)), so it does not meet the condition of closed set, and the result is worse than FM. For SI and LR, the features of each individual and language are discriminative enough to tell all classes apart, meaning that they are well distributed, so the results are better than FM.
5.2. Accuracy Comparison of Preprocessing Methods
This section shows the accuracies of four datasets when every preprocessing method is applied on them, respectively. Four sets of accuracy results and radar charts by different preprocessing methods are shown in Figures 14(a) to 14(d).
(a)
(b)
(c)
(d)
It can be seen that in general the classification algorithms produce consistent results when wavelet and LPCtoCC preprocessing methods are used. These almost allrounded accuracy results are displayed in Figures 14(a) and 14(b). Comparatively, SFX and SFX + FS yield a jagged outline for the curves of accuracy results in the radar chart, which can be seen in Figures 14(c) and 14(d). Overall, Wavelet and LPCtoCC show lower average accuracy than those in SFX and SFX + FS. Some classifiers produce exceptionally perfect accuracy on all the four datasets after statistical feature extraction and feature selection are applied. They are LMT and Multilayer Perceptron.
The classifier model generated from LMT is a single tree with different shapes on basis of various types of training data. If the data type is numeric, then a binary tree will be built with splits on those attributes; if the type is nominal, then a multisplit tree is the consequence. But the same thing is that the leaves are each logistic regression model which is quite capable for analysis of dataset with dependent features and bounded magnitudes of timeseries. The algorithm is guaranteed that only relevant attributes are selected [40]. The result is much more intelligible and reasonable than a committee of multiple trees on voice classification. So under such kind of circumstance, LMT offers a better result than other tree classifiers.
Multilayer Perceptron is a standard algorithm for any supervised learning task in data mining. The result is relatively better than any other classifiers, achieving almost 100% accuracy but the time cost is higher and sometimes unacceptable. However, some classifiers produce low accuracy, for instance, Naïve Bayes. Based on Bayes’ theorem with strong independence assumptions, Naïve Bayes acts as quite a simple classifier and it gets very widely adopted in many classification situations. But sometimes the relation between any pair of attributes is always dependent and the distribution of features is unknown in advance; thus the performance of such a simple probabilistic classifier is bad and unstable.
5.3. Overall Averaged Performance Comparison
For a throughout performance evaluation, performance consideration of other parameters is considered as well; these include Kappa, Precision, Recall, F1, and ROC, which are commonly used in assessing the quality of the classification models in data mining. These performance indicators are briefly described as follows. The performance results pertaining to these indicators are averaged over all the four datasets and all the 20 classification algorithms. They are then shown in Section 5.3.6 together with the comparison of time cost.
5.3.1. Kappa Statistic
Kappa statistic is widely used to measure variability between multiple observers. The meaning of Kappa statistic is how often multiobservers agree in terms of their interpretations. When two or more evaluators are checking the same data, Kappa statistic is assessed to show an agreement of evaluators when the same data categories are correctly assigned. As well known, simple agreement just between yes and no is poor because of the property of chance and arbitrary. That is why Kappa statistic is introduced and it is preferred [41]. The definition of Kappa statistic is given as follows: where is the relative observed agreement among raters and is the hypothetical probability of chance agreement. When the application is classification, the measure of chance between the classification results and the true classes (labeled categorical data class) is assessed by Kappa statistic. It reflects the reliability of the evaluation of our classifier. Table 6 is the general criterion of evaluating Kappa statistic [42]. A comparison of different voice datasets and different preprocessing methods, in terms of average Kappa statistic, is shown in Figure 15. Wavelet method is relatively unstable in datasets of FM, ES, and SI. The Kappa statistics for LPCCC method are almost the same across different datasets. SFX without FS, however, underperformed when compared to LPCCC in FM and ES datasets which are relatively simple. SFXFS shows its superiority in Kappa statistics in all datasets.
5.3.2. Precision
In pattern recognition and data mining, precision is the fraction of relevantly retrieved instances. In the situation of classifications, the terms positive and negative describe the classifier’s prediction results, and the terms true and false refer to whether the prediction results correspond to the fact or not [43]. This is illustrated by Table 7.
Precision is defined as: Precision is concisely defined as “of all the instances that were classified into a particular class, how many were actually belonged to that class?” In classification task, a perfect precision score for a particular class means that every instance classified into that class does indeed belong to that class (but it says nothing about the number of instances from that class that were not classified correctly). As shown in Figure 16, for example, SFXFS when applied on LR dataset has the maximum precision score 0.88—that means 88% of the instances that are classified into a particular indeed belong to that class. SFXFS for SI has precision score 0.85, for ES has only 0.64, and for FM has 0.73. Wavelet method was unacceptable for all datasets except LR, for it has merely 0.59, 0.42, and 0.25 precision scores, respectively. The comparison with respect to precision scores is shown in Figure 16.
5.3.3. Recall
In pattern recognition and data mining, recall is defined as the fraction of relevantly retrieved instances. We can infer that the same part of both precision and recall is relevance, based on which they all make a measurement. Usually, precision and recall scores are not discussed in isolation and the relationship between them is inverse, indicating that one increases and the other decreases. Recall is defined as In a classification task, recall is a criterion of the classification ability of a prediction model to select labeled instances from training and testing datasets. A recall of score 1.0 means that each instance from that particular class is labeled to this class and all are predicted correctly, and none shall be left out [44]. Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e., the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been). The recall scores defined loosely as “of all the instances that are truly of a particular class, how many did we classify them into that class?” For example, as shown in Figure 17, 86% of instances are classified into the classes and they actually belonged to those classes. Inversely 14% is missed out. Again, the recall scores for Wavelet method are comparatively low except in the LR dataset it exceeds that of LPCtoCC method. Having a low recall score means the classifier is conservative. SFXFS is outperforming the rest of the methods in terms of recall scores. The comparison is shown in Figure 17.
5.3.4. Measure
measure is the harmonic mean of precision and recall, that is, It is also known as balanced score or measure in tradition, because recall and precision are equally weighted. The general formula for measure is As mentioned before, precision and recall scores should be taken into account simultaneously because they have a strong relation essentially. Consequentially, both are combined into a single measure, which is measure. Other complicated combinations of precision and recall include but are not limited to the weighted harmonic mean of precision and recall (), and the geometric mean of regression coefficients, and Informedness and Markedness (Matthews correlation coefficient [45]). In our experiments, we only concern measure. measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average of the precision and recall. The best value is 1 and the worst is 0. Figure 18 shows a comparison of average measure for different voice datasets and different preprocessing methods. SFXFS shows superior score in datasets SI and LR; it duels with LPCtoCC in simple datasets like FM and ES.
5.3.5. ROC
A Receiver Operating Characteristic (ROC) is generated by plotting True Positive Rate (TPR) verse False Positive Rate (FPR) with many value settings of threshold. It is a graphical plot which illustrates the performance of sensitivity and specificity. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate. A ROC space is defined by FPR and TPR as and axes, respectively, with the coordinate representing the best prediction result. The areaundercurve (AUC) statistic of ROC is commonly used in machine learning and data mining community for model comparison. The AUC is an equivalent and simple replacement of ROC curve.
ROC is useful for gaining insight into the decisionmaking ability of the model—how likely is the classification model to accurately predict the respective classes? The AUC measures the discriminating ability of a classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for datasets with unbalanced target distribution (one target class dominates the other). A comparison in terms of ROC AUC which is normalized to for different voice datasets and different preprocessing methods is shown in Figure 19. Again, they show similar performance results to those in measures. SFX + FS perform equally well in SI dataset and LR dataset with 0.94 AUC; it is slightly higher than SFX and LPCtoCC in FM and ES datasets. Wavelet has the lowest AUC in all datasets except LR where it is better than that of LPCtoCC.
5.3.6. Aggregated Results
The final results that are averaged and aggregated, from the individual results tested by using different datasets and different classification algorithms, are shown as follows. We compare in particular various preprocessing methods against a collection of performance indicators, as in Table 8.
From Table 8, we can reach a conclusion that SFX with FS is indeed the most suitable preprocessing method for all types of voice datasets. It has a higher value across all performance indicators than the rest of the preprocessing methods.
The accuracy and CPU time are evaluated across different feature selection algorithms; the averaged results together with the amount of attributes before and after FS are shown in Table 9.
In Table 9, the first three FS algorithms have been widely used, and the last one is recently proposed by Fong [20]. WSA gives the second fewest number of attributes after feature selection, highest classification accuracy, and a compromising time cost with 31 seconds minimum. So to some extent WSA is a good choice of feature selection if time requirement is not a concern in training up a voice classification model. WSA is done at the cost of incurring extra time in doing the heuristic optimization on the feature subset.
Table 9 shows the overall averaged time cost of each process step applied on different datasets. Piecewise transformation and DTW need much longer time than the other processes due to the computational complexity. The time consumption by piecewise transformation is relatively long especially for complex datasets like SI and LR. Statistic measures are computed for each segment (20x) for each timeseries. WSA works as a stochastic iteration model, which progressively refines the performance and is superior to the other three FS methods but comes at a certain time cost. In contrast the classification model construction times in general are very short, with an average of less than two seconds. Please see Table 10. The total time required for preprocessing voice data for classification ranges from slightly less than an hour to four hours and eighteen minutes, depending on the choice of preprocessing algorithms and complexity of the datasets. Be reminded that the reference of time consumption shown here is for training a classifier based on the given training set; once a classifier is trained, the testing is very fast that takes almost no time. Therefore, a system designer can choose the best performing algorithms in terms of accuracy and other performance quality indicators if the voice classification application is not prone to frequent update of training dataset (that means no need to build the classification model over again), and of course vice versa this implies.
6. Conclusion and Future Works
Human voice is referred to as one of the bodily vital signs that could be measured, recorded, and analyzed as fluctuations of amplitude of sound loudness. Voice classification constitutes to a number of biometrics techniques of which the theories have been formulated, studied, and implemented in practical applications. Traditional classification algorithms from data mining domain, however, require the input of training data to be formatted in a data matrix where the columns represent features/attributes that characterize the voice data, and the rows are the instances of the voice data. Each record must have a verdict known as predicted class for training data. In the literature, mainly the characteristics of voice data are acquired from the frequency domain, for example, LPC, cepstral coefficients, and MFCC. Those popular preprocessing methods have demonstrated significant advantages in transforming voice data which is in the form of timeseries to signatures in the frequency domain. There exist possibilities that some useful attributes can be harvested from the time domain considering the temporal patterns of voice data that are supposedly distinctive from one another. A challenge to overcome is its expensive computational cost of time and large search space in the time domain.
Considering the stochastic and nonstationary nature of human voice, a hybrid data preprocessing methodology is adopted in voice classification in this paper, where combined analysis from both frequency and time domain is included. In particular, a time domain feature extraction technique called Statistics Feature Extraction (SFX) is presented. SFX utilizes piecewise transformation that partitions a whole timeseries into segments and statistics features are extracted subsequently from each piece. Simulation experiments were conducted on classifying four types of voice data, namely, Female and Male, Emotional Speech, Speaker Identification, and Language Recognition into different groups by using SFX and its counterparts (SFX and Feature Selection). The results showed that SFX is able to achieve a higher accuracy in the classification models for the four types of voice data.
The contribution is significant as the new preprocessing methodology can be adopted by fellow researchers that will enable them to build more accurate voice classification model. Besides, the feature selection result proves that a metaheuristic feature selection algorithm called Wolf Search (WSA) can achieve a global optimal feature subset for highest possible classification accuracy. As there is no free lunch in the world, WSA costs considerable amount of computational time.
The precision of piecewise transformation segmentation can be one of the future works. If the number of segments is too large (low resolution in timeseries modeling), then it will lead to the low accuracy of feature extraction; if the window is too small (with very refined resolution), then a lot more computational costs are incurred. Although calibration was done beforehand for calculating the ideal segment length for subsequent processing, this again contributes to extra processing time, and the calibrated result may need to be refreshed should the natures of the voice data evolve. Some dynamic and incremental methods are opted for solving this calibration problem for estimating the correct length of segments. Furthermore the segment lengths can be variables that cope with the level of fluctuation of the voice data, dynamically.
Acknowledgments
The authors are thankful for the financial support from the research grant “Adaptive OVFDT with Incremental Pruning and ROC Corrective Learning for Data Stream Mining,” Grant no. MYRG073(Y2L2)FST12FCC, offered by the University of Macau, FST, and RDAO.