Abstract

This paper proposes the use of a similarity measure based on information theory called correntropy for the automatic classification of pathological voices. By using correntropy, it is possible to obtain descriptors that aggregate distinct spectral characteristics for healthy and pathological voices. Experiments using computational simulation demonstrate that such descriptors are very efficient in the characterization of vocal dysfunctions, leading to a success rate of 97% in the classification. With this new architecture, the classification process of vocal pathologies becomes much more simple and efficient.

1. Introduction

In the past few decades, medicine has received noteworthy contributions, which have allowed for great advancement in medical activity within various contexts, for example, improvement of surgery techniques, description of the human genome, or even assistance in medical diagnosis. In particular regarding medical diagnosis, digital signal processing techniques have been employed recently as an efficient, noninvasive, and low-cost tool to analyze vocal signals, with the aim of detecting and classifying alterations in the production of sounds that may be associated with larynx pathologies [1]. The human voice is an important communication tool and any inadequate behavior may have deep implications in an individual’s social and professional life.

The acoustic analysis is a complementary technique to methods based on the direct inspection of vocal folds, which may reduce the frequency of invasive exams, since such measurements are able to reveal important physiological characteristics of the vocal tract [2]. In general, most of the techniques involving the analysis of dysfunctions available in the literature employ several sorts of preprocessing stages in order to extract useful characteristics for the classification of diverted patterns and to obtain performance improvement of classifiers [211].

The existing systems for detection and classification of laryngeal pathologies are phoneme-dependent, where the train step used fixed phonemes, for example, /a/ [2, 3], and diagnosis can be done by taking any vowel, which can be phonated comfortably. Various techniques have been proposed for the extraction of signal features in order to improve the performance of automatic pathological speech detection systems. A method based on MPEG-7 audio low level is proposed in [3] for the extraction of features that can be classified by support vector machine (SVM). Similarly, SVM classifiers have also been used in the features extraction based on wavelet transform, nonlinear analysis of temporal series, and information theory [48]. Mel-frequency cepstral coefficients and linear prediction cepstral coefficients are used as acoustic features in [9] associated with a strategy based on combining classifiers with Gaussian mixture model, hidden Markov model (HMM), and SVM. Specifically, feature extraction is achieved in [10] by using eight measures derived from the nonlinear dynamic analysis (correlation dimension, four entropy measures, Hurst exponent, the largest Lyapunov exponent, and the first minimum of mutual information function) and quadratic discriminant analysis (QDA) by classifier [10]. Nonlinear Teager-Kaiser energy operator has been used for features extraction in [11] with a classifier based on neural system of multilayer perceptron (MLP).

Thus, it is possible to state that all aforementioned methods employ classifiers with complex architectures, for example, MLP, SVM, and HMM, and a complex stage for the features extraction, which can restrict their application or even make the technique unfeasible in real-time systems. In addition, most research works are based on the discrimination between healthy and pathological voices, but not the pathology types.

This paper introduces a new low-complexity approach for the detection and classification of laryngeal pathologies. This system uses a measure of information theory known as correntropy to characterize vocal pathologies by means of spectral characteristics and statistical moments of high order involving the vocal signal. The system is characterized by high success rates and low computational complexity associated with a simple classification stage based on Euclidian distance. The influence of the kernel size in the classification hit rate is analyzed, since it is the only free parameter of the method. Besides the classification between healthy and pathological voices, it is possible to define which pathology is most likely to exist in the voice under analysis.

The paper is organized as follows. Section 2 presents the main concepts related to correntropy. The proposed architecture is described in Section 3. In Section 4, the performance results of the proposed technique are discussed, while Section 5 presents the final considerations.

2. Correntropy

The extraction of information from a database is a frequent and relevant problem in various applications involving signal processing. Within this context, statistical measures of similarity such as correntropy may be successfully used for the extraction of information and data characterization.

Correntropy is a generalization of the correlation measure between random signals because such measure is able to extract both second-order and higher-order statistical information from the analyzed signals [12]. In the past few years, this concept been successfully applied in the solution of various engineering problems, such as the modeling of temporal series [13], nonlinearity tests [14], objects recognition [15], analysis of independent components [16], and automatic modulation classification [17]. Although correntropy is similar to correlation by definition, recent studies have demonstrated that its performance is superior when dealing with nonlinear or non-Gaussian systems without any significant increase of computational cost [12].

According to [12] correntropy is between random signals , where represents time and is the set of interest indices defined as where is the expectation operator and corresponds to any symmetric function defined as positive. It is observed that correntropy is defined as the joint expectation of . In this work, is a Gaussian kernel given as where is the variance defined as the kernel size. The kernel size may be interpreted as the resolution for which correntropy measures similarity in a space with characteristics of high dimensionality [12].

In fact, by applying an extension of the Taylor series to the correntropy measure, (1) can be rewritten as [12]

It is possible to notice that (3) defines the sum of infinite moments of even order, which therefore includes its own conventional covariance. Consequently, correntropy contains information of infinite statistical moments involving its data.

It is interesting to observe that the kernel size in (3) is a parameter that ponders the influence of the second-order and higher-order moments. For sufficiently large values, the second-order moments are dominant and the measure gets closer to the correlation.

When only a finite amount of data is available, that is, , it is possible to define the estimator to the autocorrentropy function as [18]

The autocorrentropy with an estimator defined by (4) does not ensure zero average even when the input data are centralized due to the nonlinear transforms produced by the kernel. However, a centralized estimator for correntropy has been defined in [18], which is given as

This work explores the spectral properties of centralized autocorrentropy in voice signals in order to detect and classify vocal pathologies. Specifically, the spectral descriptors of healthy pathological voices used in the classification are extracted from the signals analyzed by the correntropy spectral density (CSD) function, defined as [19] where is the digital frequency given in radians. The CSD function may be considered a generalization of the power spectral density (PSD) of a signal.

One of the advantages of using correntropy measures in the classification of vocal signals lies in the robustness of such measures against impulsive noise due to the use of the Gaussian kernel in (5), which is close to zero, that is, when or is an outlier. In addition, the correntropy function extract information of higher-order statistical moments present in the data, thus potentially increasing the classification efficiency the proposed technique.

3. Proposed Architecture

The classification method of vocal pathologies proposed in this paper is composed of two stages. The first stage is characterized by the extraction of descriptors for the voice signals based on the CSD defined in (6). The second stage is responsible for the classification of voices by simple metrics of the Euclidian distance.

Since the analyzed voice signals may vary in amplitude, the normalization of all signals in the database is performed. Then, the signals are clustered into two sets: one set with healthy voice signals and another one with pathological voice signals, which contain voices with edema and nodules. After the calculation of the CSD for all signals in each set, the average value of CSD is calculated for each set. The average CSDs of the sets are used as descriptors for the healthy and pathological voices. This procedure is depicted in Figure 1. An equivalent methodology is applied to obtain the descriptors for the voices with edema and nodules. This procedure is detailed in Figure 2.

The CSDs of the healthy and pathological voices and the respective descriptors stored for the classification stage are shown in Figures 3 and 4. It is possible to observe in Figure 3 that the average CSDs in the healthy voices have correntropy and frequency values that are different from those in the average CSDs of pathological voices. On the other hand, it may be observed in Figure 4 that the average CSDs for voices with edema and nodules have different correntropy values, even though the frequencies are similar. In order to decrease computational cost and improve the success rate, the descriptors of different classes considered in this work are constituted by the first fifty samples because there is a clear distinction involving the respective average CSDs of each class within this interval.

The classification architecture proposed in this work is presented in Figure 5. Initially, the desired voice signal to be calculated is normalized and its respective CSDs are obtained from (6). Then, the Euclidian distance between the voice CSD and each descriptor for the classes defined as healthy and pathological voices is calculated. The closer descriptor obtained according to the Euclidian distance criterion defines the class of such signal. If the analyzed voice is classified as pathological, it is necessary to apply a new classification stage in order to distinguish between an edema and a nodule.

4. Experiments and Results

The proposed architecture has been validated by computer simulation performed in MATLAB. The architecture performance was evaluated according to Monte Carlo’s method. For each experiment a minimum number of 100 trials were used. The set of voices for each class is divided into two subsets: one to extract the descriptors and another for the classification test. Crossvalidation is used in this case, where 50% of the voices are for the extraction of descriptors and the remaining 50% are used for test purposes. The evaluation of the architecture is based on the average success rates and also maximum, minimum, and standard deviation.

4.1. Database

The database used for this work was developed by Massachusetts Eye and Ear Infirmary (MEEI) Voice and Speech [20]. This database has been widely used in international research works, in acoustic analysis of disordered voices, and in the discrimination between healthy and pathological voices. It contains the sustained pronunciation of vowel “a” with 116 files from distinct speakers, where 53 are healthy voices, 43 are voices affected by edema, and 20 are voices with nodules. All used signals have the duration from 1 to 3 seconds, sampling frequency of 25 kHz, and resolution of 16 bits.

4.2. Experiments

Initially, 43 voices affected by edema and 20 voices with nodules are joined in one database called pathological. The first experiment is characterized by extracting the success rate between healthy and pathological voices. Then the success rate is assessed among 43 voices with edema and 20 voices with nodules.

The only adjustable parameter in the architecture is the variance of the Gaussian kernel, that is, the kernel size. Many systems use a heuristic known as Silverman’s rule to determine such variance [21]. However, this work has employed a numerical evaluation method to determine the influence of the kernel size on the accurate classification rate for the proposed architecture. The performed experiments are represented in Figures 6 and 7.

From the aforementioned experiments, it is possible to determine suboptimal values for the kernel width associated with each class of vocal disease. The kernel size was tested by using a logarithmic scale with values ranging from 0.01 to 10. The kernel adjustment has provided an effective mechanism to eliminate outliers. The right choice of its size may increase the success rate considerably. The amount of samples is also a fundamental factor in the classification rates, according to the results given in Figures 6 and 7.

The results shown in Figure 6 demonstrate that the independently of the kernel size the classifier for healthy and pathological voices represents an unsatisfactory result when only a few samples are used. However, when the kernel is equal to 0.77 and about 1,000 signal samples are used, the classifier has presented a success rate of 93%.

One of the goals of this work is to reduce the complexity associated with the architecture, while a classifier based on the Euclidean distance is used. However, it is sensitive to the sample size as it is necessary optimize such parameter. Thus, it is possible to state that amounts of samples higher than 1,000 affect success rates between healthy and pathological voices.

The success rate in the classification between voices with edema and nodules varied from 65% (when the kernel size is 0.31 and the number of samples is 700) to 98% (when the kernel size is 1.77 and 1,300 samples are adopted), according to Figure 7.

After adjusting the architecture with the adequate values for the kernel size and number of samples, the correntropy measure was investigated regarding its capability of classifying and characterizing the statistical independencies of the considered voice signals. Accordingly, correlation is adopted as a reference measure since it can be seen as a particular case of correntropy and is very often used in classification problems. The obtained results are presented in Tables 1 and 2. The average success rate is determined considering 100 trials.

Based on Table 1, it is possible to observe that the architecture based on correntropy is able to distinguish between healthy and pathological voices with a success rate between 91.22% and 95.65% and average rate of 93.81%, while the standard deviation is 1.30%. On the other hand, when an architecture based on correlation is used, the success rate of the classifier is reduced considerably. Besides, it can be stated that the standard deviation of the recognition rate for the proposed architecture is much lower if compared to that for the architecture based on correlation, thus indicating higher reliability of the classifier with correntropy.

From Table 2, it can be seen that the recognition rate between the voices with edema and nodules is considerably higher in the architecture based on correntropy. Once again, the success rate of the proposed architecture is within an acceptable range of values and with low standard deviation.

4.3. Comparison with Existing Methods

Several methods for the detection of vocal pathologies have been proposed in the literature. With the aim of better assessing the classifier developed in this work, the obtained results are compared with those regarding the works in [6, 8, 10], which were also obtained using the MEEI database [20].

The work in [6] considers a transform initially applied to the voice signal to obtain a space of smaller dimension by using the decomposition in singular values, that is, higher-order singular value decomposition (HOSVD). In the classification stage, measure of mutual information and a SVM network are used. The success rate is around 94.1% with an interval of reliability of 0.28%.

The classification architecture in [8] is developed from eleven characteristics extracted by means of nonlinear analysis of temporal series, where two are based on conventional nonlinear statistics, other two are based on the analysis of recurrence and fractal scheduling, and the rest of them are obtained from different estimations of the entropy. The achieved success rate is 98% by using a SVM and Gaussian mixture models (GMM) in the classification stage.

Information measures are employed in [10], for example, Shannon entropy, correlation entropy, approximate entropy, Tsallis entropy, Hurst exponent, maximal Lyapunov exponent, and the first minimum of the mutual information function, in addition to LPC (linear prediction coding) coefficients [10]. In the classification process, a quadratic discriminant analysis (QDA) is applied with a success rate equal to about 96.50%.

Thus, it is possible to state that all aforementioned methods employ a complex stage for the extraction of characteristics, with the calculation of a large set of variables that, in general, are sent to a neural network for classification purposes.

On the other hand, the architecture proposed in this paper uses only one extractor defined by the CSD and also a very simple classification stage based on Euclidian distance. Therefore, the introduced strategy presents low computational complexity, which implies simple implementation in real-time embedded systems. Besides, the proposed system presents high success rate, that is, about 97%.

5. Conclusion

This paper has presented a novel method of automatic classification for pathological voices based on correntropy spectral density (CSD). It has been demonstrated that CSD is adequate to characterize dynamic interdependencies among the voice signal samples, being able to extract distinct characteristics between healthy and pathological voices. Among the main characteristics of such method, it is worth mentioning that the classification stage becomes simpler by the use of Euclidian distance, which effectively reduces its computational complexity. From the obtained results, it has been shown that the proposed classifier presents high recognition rate, which is achieved after a simple adjustment in the kernel size employed by the feature extractor. The proposed method can be used as a valuable tool by researchers and speech pathologists. Future work aims at the development of experiments using other databases and also the implementation of an online diagnosing system.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.