Abstract

Disordered voices are frequently assessed by speech pathologists using perceptual evaluations. This might lead to problems caused by the subjective nature of the process and due to the influence of external factors which compromise the quality of the assessment. In order to increase the reliability of the evaluations, the design of automatic evaluation systems is desirable. With that in mind, this paper presents an automatic system which assesses the Grade and Roughness level of the speech according to the GRBAS perceptual scale. Two parameterization methods are used: one based on the classic Mel-Frequency Cepstral Coefficients, which has already been used successfully in previous works, and other derived from modulation spectra. For the latter, a new group of parameters has been proposed, named Modulation Spectra Morphological Parameters: MSC, DRB, LMR, MSH, MSW, CIL, PALA, and RALA. In methodology, PCA and LDA are employed to reduce the dimensionality of feature space, and GMM classifiers to evaluate the ability of the proposed features on distinguishing the different levels. Efficiencies of 81.6% and 84.7% are obtained for Grade and Roughness, respectively, using modulation spectra parameters, while MFCCs performed 80.5% and 77.7%. The obtained results suggest the usefulness of the proposed Modulation Spectra Morphological Parameters for automatic evaluation of Grade and Roughness in the speech.

1. Introduction

With the aim of diagnosing and evaluating the presence of a voice disorder clinicians and specialists have developed different assessment procedures [1] such as exploration using laryngoscopic techniques, acoustic analysis, or perceptual evaluations. The latter is widely used by clinicians to quantify the extent of a dysphony. Some well-known perceptual evaluation procedures are the Buffalo Rating Voice Profile [2], Consensus Auditory Perceptual Evaluation of Voice (CAPE-V) [3], and GRBAS [4]. The main problem with perceptual analysis is the high intra/interrater variability [5, 6] due to the subjectivity of the assessment in which the experience of the evaluator, his/her physical fatigue, mental condition, and some other factors are involved. Hence, means such as acoustic analysis based on signal processing might be valuable in clinical scenarios, providing objective tools and indices which can directly represent the level of affection or at least help clinicians to make a more reliable and less subjective perceptual assessment. This noninvasive technique can complement and even replace other invasive methods of evaluation.

Besides, the large amount of improvements in the field of speech signal processing is addressed mostly to areas such as speech or speaker recognition. Many of these advances are being transferred to biomedical applications for clinical purposes; some recent examples are related to different uses such as telemonitoring of patients [7], telerehabilitation [8], or clinical-support systems [9]. However, there is a substantial quantity of research to be done for further enhancements. Roughly speaking, most of the studies in this field can be divided into three main categories: the first one is focused on developing automatic detectors of pathological voices [1014] capable of categorizing voices between normal and pathological; the second group works with classifiers of pathologies [11, 15, 16] which consists in determining the speech disorder of the speaker using the acoustic material; and the third and last group aims to evaluate and assess the voice quality [8, 1723]. The present study can be framed in the third mentioned category, highlighting the fact that the main goal is the development of new parameterization methods.

The essential common characteristic of all the automatic systems found in the literature is the need to extract a set of parameters from the acoustic signal to accomplish a further classification task. Regarding these parameters, some works use amplitude perturbations such as Shimmer or Amplitude Tremor Intensity Index (ATRI) [2426] as input features while others are centered on frequency perturbations using Jitter [8, 25, 26], frequency and cepstral-based analysis [8, 13, 14, 16, 17, 27, 28], Tremor Intensity Index (TRI) [24, 26], or Linear Predictive Coding (LPC) [29]. Noise-based parameters [30, 31] and nonlinear analysis features [12, 18, 25, 32] are likewise widely used in this kind of automatic detectors. Moreover, other varieties of feature-extraction techniques such as biomechanical attributes or signatures can be applied for the same purposes [10].

Focusing on the third kind of the aforementioned categories of detectors, those assessing the quality of voice, some are employed for simulating a perceptual evaluation such as GRBAS. For instance, several classification methods were used in [19, 20] to study the influence of the voice signal bandwidth in perceptual ratings and automatic evaluation of GRBAS Grade () trait using cepstral parameters (Linear Frequency Spectrum Coefficients and Mel-Frequency Spectrum Coefficients). Efficiencies up to 80% were obtained using Gaussian Mixture Models (GMM) classifiers and leave-x-out [33] cross-validation techniques. Similar parameterization methods were used in [9] to automatically evaluate with a Back-and-Forth Methodology in which there is feedback between the human experts that rated the database and the automatic detector, and vice versa. On [22] a group of 92 features comprising different types of measurements such as noise, cepstral and frequency parameters among others were used to detect GRBAS Breathiness (). After a reduction to a four-dimensional space, a 77% of efficiency was achieved using a 10-fold cross-validation scheme. Authors in [34] fulfilled a statistical study of acoustic measures provided by two commonly used analysis systems, Multidimensional Voice Analysis Program by Kay Elemetrics and Praat [35] obtaining good correlations for and traits. On [21] Mel-Frequency Spectrum Coefficients (MFCCs) were utilized obtaining 68% and 63% of efficiency for Grade and Roughness () traits, respectively, using Learning Vector Quantization (LVQ) methods for the pattern recognition stage but without any type of cross-validation techniques. The review of the state of the art reports that only [36] has used the same database and perceptual assessment used in the present study. The mentioned work proposed a set of complexity measurements and GMM to emulate a perceptual evaluation of all GRBAS traits, but its performance does not surpass for or .

In general, results seldom exceed 75% of efficiency; hence, there is still room for enhancement in the field of voice quality automatic evaluation. Thus, new parameterization approaches are needed and the use of Modulation spectrum (MS) emerges as a promising technique. MS provides a visual representation of sound energy spread in acoustic and modulation axes [37, 38] supplying information about perturbations related to amplitude and frequency modulation of the voice signal. Numerous acoustic applications use these spectra to extract features from audio signals from which some examples can be found in [3942]. Although there are few publications centered in the characterization of dysphonic voices using this technique [11, 12, 23, 43], it can be stated that MS has not been studied deeply in the field of the detection of voice disorders and specially as a source of information to determine patient’s degree of pathology. Some of the referred works have used MS to simulate an automatic perceptual analysis but, to the best of our knowledge, none of them offer well-defined parameters with a clear physical interpretation but transformations of MS which are not easily interpretable, limiting their application in the clinical practice.

The purpose of this work is to provide new parameters obtained from MS in a more reasoned manner, making them more comprehensible. The use of this spectrum and associated parameters as support indices is expected to be useful in medical applications since they provide easy-to-understand information compared to others such as MFCC or complexity parameters, for instance. The new parameterization proposed in this work has been used as the input to a classification system that emulates a perceptual assessment of voice following the GRBAS scale in and traits. These two traits have been selected over the other three (Aesthenia (A), Breathiness, and Strain (S)) since its assessment seems to be more reliable. De Bodt et al. [5] point that is the less unambiguously interpreted and has an intermediate reliability on its interpretation. These conclusions are coherent with those exposed in [44, 45]. Similar findings are revealed in [6] which considers as one of the most reliable traits when using sustained vowel as source of evaluation. It is convenient to specify that each feature of the GRBAS scale ranges from 0 to 3, where 0 indicates no affection, 1 slightly affected, 2 moderately affected, and 3 severely affected voice regarding the corresponding trait. Thus evaluating according to this perceptual scale means developing different 4-class classifiers, one for each trait.

In this work, the results obtained with the proposed MS-based parameters are compared with a classic parameterization used to characterize voice in a wide range of applications: Mel-Frequency Cepstral Coefficients [46]. MFCCs have been traditionally used for speech and speaker recognition purposes since the last two decades and many works use these coefficients to detect voice pathologies with a good outcome.

The paper is organized as follows: Section 2 develops the theoretical background of modulation spectra features. Section 3 introduces the experimental setup and describes the database used in this study. Section 4 presents the obtained results. Lastly, Section 5 presents the discussions, conclusions, and future work.

2. Theoretical Background

2.1. Modulation Spectra

This study proposes a new set of parameters based on MS to characterize the voice signal. MS provides information about the energy at modulation frequencies that can be found in the carriers of a signal. It is a three-dimensional representation where abscissa usually represents modulation frequency, ordinate axis depicts acoustic frequency, and applicate, acoustic energy. This kind of representation allows observing different voice features simultaneously such as the harmonic nature of the signal and the modulations present at fundamental frequency and its harmonics. For instance, the presence of tremor, understood as low frequency perturbations of the fundamental frequency, can be easily noticeable since it implies a modulation of pitch as an usual effect of laryngeal muscles improper activity. Other modulations associated with fundamental or harmonic frequencies could indicate the presence of a dysfunction of the phonatory system. Some examples can be found in [11].

To obtain MS, the signal is filtered using a short-time Fourier transform (sTFT) filter bank whose output is used to detect amplitude and envelope. This outcome is finally analyzed using FFT [47] producing a matrix where MS values at any point can be represented as . The columns at (fixed ) are modulation frequency bands, and rows (fixed ) are acoustic frequency bands. Therefore, can be interpreted as the index of acoustic bands and , the index of modulation bands while and are the central frequencies of the respective bands. Due to the fact that values have real and imaginary parts, the original matrix can be represented using the modulus and the phase of the spectrum. Throughout this work, the MS has been calculated using the Modulation Toolbox library version 2.1 [48]. Some different configurations can be used to obtain , where the most significant degrees of freedom are the use of coherent or noncoherent (Hilbert envelope) [49] modulation, the number of acoustic bands, and acoustic and modulation frequency ranges. The three-dimensional phase unwrapping techniques detailed in [50] are used to solve the phase ambiguity problems which appear when calculating .

Figure 1 shows an example of MS extracted from two different voices on which the voice of a patient with gastric reflux, edema of larynx, and hyperfunction exhibits a more spread modulation energy in comparison to a normal voice.

However, one of the principal drawbacks of MS is that it provides a large amount of information that can not be easily processed automatically due to limitations of the existing pattern recognition techniques and voice disorders databases available. In this sense, MS matrices have to be processed to obtain a more compact but precise enough representation of the represented speech segments. Thus, after obtaining the MS, some representative parameters are extracted to feed a further classification stage. With this in mind, a new group of Morphological Parameters based on MS is proposed in this work: centroids [51] (MSC), dynamic range per band (DRB), Low Modulation Ratio (LMR), Dispersion Parameters (CIL, PALA, and RALA), Contrast (MSW), and Homogeneity (MSH). All these parameters use the MS modulus as input source, except the last two which also use the phase.

2.1.1. Centroids (MSC) and Dynamic Range per Band (DRB)

Centroids provide cues about the acoustic frequency that represents the central energy or the energy center at each modulation band. To obtain MSC, MS modulus is reduced to an absolute number of modulation bands usually ranging from 4 to 26, each containing information about the modulation energy in that band along the acoustic frequency axis. Once the reduced MS is computed, centroids are calculated following the expressionwhere and represent the central frequency of the acoustic and modulation bands, respectively, and is the pitch frequency.

As a matter of example, Figure 2 depicts a representation of MSC extracted from a MS.

Once MS is reduced to a small number of modulation bands, the dynamic range is calculated for every band (DRB) as the difference between the highest and the lowest levels in the band. These parameters provide information about the flatness of the MS depending on the modulation frequency.

2.1.2. Low Modulation Ratio (LMR)

LMR, expressed in dB, is the ratio between energy in the first modulation band at acoustic frequency and the global energy in all modulation bands covering at least from 0 to 25 Hz at acoustic frequency , . Its calculation is carried out according to the following expressions. These bands are represented in Figure 3:beingwhere is the index of the acoustic band including pitch frequency and , the index of the modulation band including 25 Hz.

The 0–25 Hz band has been selected to represent all possible cases of tremor and low frequency modulations around pitch frequency [52, 53].

2.1.3. Contrast and Homogeneity

Representing MS (modulus or phase) as two-dimensional images let observe that pathological voices usually have more complex distributions. Images related to normal voices are frequently more homogeneous and present less contrast, as can be seen in Figure 1. Accordingly, Homogeneity and Contrast are used as two MS features since they provide information about the existence of voice perturbations.

Homogeneity is computed using the Bhanu method described by the following expression, as stated in [54]:with MSH being the MS Homogeneity value; the modulation spectra computation (modulus or phase) at point ; and the average value in a window centered at the same point.

Contrast is computed using a variation of the Weber-Fechner contrast relationship method described by the following expression as stated in [54]:whererepresenting the vertical and horizontal adjacent points to . The global MSW is considered the sum of all points in divided by the total number of points to normalize.

The MS used to calculate MSH and MSW at each point of the matrix is represented in Figure 3.

2.1.4. Dispersion Parameters

As MS differs from normal to pathological voices, changes in the histograms of MS modulus reflect the effects of a dysfunction in a patient’s voice. A short view to the MS permits to observe that voices with high and traits usually have a larger number of points with levels above the average value of . The level of these points can be interpreted as the dispersion of the energy present in the central modulation band (0 Hz) towards side bands respecting the case of a normal voice.

With this in mind, three Morphological Parameters are proposed to measure such dispersion effect: Cumulative Intersection Level (CIL), Normalized Number of Points above Linear Average (PALA), and Ratio of Points above Linear Average (RALA). CIL is the intersection between the histogram increasing and decreasing cumulative curves. Histogram is processed from MS modulus in logarithmic units (dB). As shown in Figure 4, CIL tends to be higher in pathological than in healthy voices. In that case, the difference is 19 dB. On the other hand, PALA is the number of points in MS modulus which are above average (linear units) divided by the total number of points of MS. RALA is quite similar to PALA but in this case it represents the ratio of points in MS modulus which are over the average and the number of points which are above this average instead of the total number of points in . Calculation of PALA and RALA is detailed in the following expressions:beingwhere is the MS modulus average, NA the number of points above , NB the number of points below , and NT the total number of points in .

In the cases in which the number of points above linear average increases, the difference between PALA and RALA increases too as the denominator in PALA stays constant and the denominator in RALA decreases. Figure 5 represents these points in a healthy and a pathological voice. It is noticeable that, as expected, the MS of dysphonic voices presents more points above the modulus average.

3. Experimental Setup

3.1. Database

The Kay Elemetrics Voice Disorders Database recorded by the Massachusetts Eye and Ear Infirmary Voice Laboratory (MEEI) was used for this study [55] due to its commercial availability. The database contains recordings of the phonation of the sustained vowel (53 normal, 657 pathological) and utterances corresponding to continuous speech during the reading of the “Rainbow passage” (53 normal, 661 pathological). The sample frequency of the recordings is 25 kHz with a bit depth of 16 bits. From the original amount of speakers recorded in the database, a first corpus of 224 speakers was selected according to the criteria found in [56] being named henceforward as the original subset. The utterances corresponding to the sustained vowel and the continuous speech recordings were used to rate and for each patient according to the GRBAS scale. The degree of these traits has been estimated three times by two speech therapists. One of them evaluated the whole database once, and the second one performed the assessment twice in two different sessions. Regarding this study, only the sustained vowels are considered. With the aim of obtaining more consistent labels, two reduced subsets of 87 and 85 audio files for and , respectively, were considered. Those files are chosen from the initial corpus of 224 recordings on the basis of selecting only those whose labeling was in a total agreement for the three assessments making up the and agreement subsets. This reduction was performed to avoid modeling inter/intraraters variability inherent to the process of assigning perceptual labels to each speaker. In any case, all tests were performed for the three subsets to provide evidences about such reduction. Some statistics of database are shown in Table 1.

With the aim of sharing relevant information and to promote a more reliable comparison of techniques and results, the names of the recordings extracted from MEEI corpus that were used for this study along with their respective and levels are included in Appendix, Table 6.

3.2. Methodology

One of the purposes of this work is to test a new source of parameters to characterize voice perturbations by replicating clinician’s and perceptual evaluations. So as to quantify the contribution of this new approach, a baseline parameterization has been established to compare with the novel one. Consequently, all tests are performed using the parameters of the baseline system (MFCCs) and the MS Morphological Parameters. A large number of tests were accomplished to find the best setting, modifying the number of centroids or the frame duration among other degrees of freedom.

The methodology employed in this paper is shown in Figure 6, while each one of its stages is explained next. Basically, it is the classical supervised learning arrangement, which can be addressed using either classification or regression techniques. For the sake of simplicity and to concentrate on the novel parameterization approach, a simple Gaussian Mixture Model (GMM) classification back-end was employed to recognize the presence of the perturbations in the voice signal which presumably would produce high levels of and during perceptual analysis.

3.2.1. Characterization

Two parameterization approaches are considered in this study: MFCCs and MS Morphological Parameters. The MFCCs are the ground of the baseline system and were used for comparison due to their wide use in speech technology applications.

The MFCCs are calculated following a method based on the human auditory perception system. The mapping between the real frequency scale (Hz) and the perceived frequency scale (mels) is approximately linear below 1 kHz and logarithmic for higher frequencies. Such mapping converts real into perceived frequency. In this work MFCCs are estimated using a nonparametric FFT-based approach. Coefficients are obtained by calculating the Discrete Cosine Transform (DCT) over the logarithm of the energy in several frequency bands. The bandwidth of the critical band varies according to the perceived frequency. Each band in the frequency domain is bandwidth dependant of the filter central frequency. The higher the frequency is, the wider the bandwidth is. To obtain these parameters, a typical setup of 30 triangular filters and cepstral mean subtraction was used. Their computation is carried out over speech segments framed and windowed using Hamming windows overlapped 50%. Duration of frames oscillates from 20 to 100 ms in 20 ms steps. For the sake of comparison the number of MFCCs ranges from 10 to 22 coefficients. 0’th order cepstral coefficient is removed.

Regarding the MS Morphological Parameters, each signal is also framed and windowed using Hamming windows overlapped 50%. The window lengths are varied in the range of 20–200 ms in 20 ms steps. The feature vector extracted from MS is composed of the following: MSC, DRB, LMR, MSW, MSH, CIL, PALA, and RALA. The number of bands to obtain centroids and dynamic range features is varied in the range of with a step size of 2. Considering that MSW and MSH provide two features each (one for modulus and other for phase), the feature vector corresponding to each frame ranges from 20 to 44 values before using data reduction techniques. Both, coherent and noncoherent modulation (Hilbert envelope) were used for testing separately. Acoustic frequency span [0–12.5 kHz] is divided into 128 bands and maximum modulation frequency varied from 70 to 500 Hz to allow different configurations during tests.

In addition, first derivative () and second derivative (), representing the speed and acceleration in the changes of every characteristic, are added to the features in order to include interframe attributes [46]. The calculation of and was carried out employing finite impulse response filters using a length of 9 samples to calculate and 3 in the case of .

All these features are used to feed a subsequent classification phase in two different ways depending on the test: some experiments are accomplished using features as they are obtained, and others use a reduced version to relieve the curse of dimensionality effect.

In the dimensionality reduction stage, PCA [57] and LDA [58] techniques are used varying the dimension of the feature vectors used for classification. In the case of LDA, all feature vectors are reduced to a 3-dimensional space. Concerning PCA, reduction ranges from 80 to 95%. With respect to these techniques, only the training data set is used to obtain the models which are employed to reshape all the data: training and test data sets. This process is repeated for every iteration of the GMM training-test process carried out for validation. The dimensionality reduction is applied for both MS Morphological Parameters and MFCCs features with and without derivatives separately.

3.2.2. Validation

Following the characterization, a Leave-One-Out (LOO) cross-validation scheme [33] was used for evaluating the results. On this scheme one file is considered for testing and the remaining files of the database are used as training data, generating what is called a fold. As a result, there are as many folds as number of files, and each of them will provide a classification accuracy. The global result for a certain parameterization experiment is the average of the results in all folds. In spite of having a higher computational cost, this cross-validation technique has been selected instead of other less computationally costly such as -folds [59] due to its suitability in view of the reduced number of recordings contained in the agreement subsets.

3.2.3. Classification

The features extracted during the parameterization stage are used to feed the classifier, which is based on the Gaussian Mixture Model (GMM) paradigm. Having a data vector of dimension resulting from the parameterization stage, a GMM is a model of the probability density function defined as a finite mixture of multivariate Gaussian components of the form:where are scalar mixture weights, are Gaussian density functions with mean of dimension and covariances of dimension , and comprises the abovementioned set of parameters that defines the class to be modeled. Thus, for each class to be modeled (i.e., values of the and perceptual levels: 0, 1, 2, or 3), a GMM is trained. is estimated using the expectation-maximization algorithm (EM) [60]. The final decision about the class that a vector belongs to is taken establishing for each pair of classes , a threshold over the likelihood ratio (LR), that in the logarithmic domain is given by

The threshold is fixed at the Equal Error Rate (ERR) point.

In this stage, the number of Gaussian components of the GMM was varied from 4 to 48. The assessment of the classifier was performed by means of efficiency and Cohen’s Kappa Index () [61]. This last indicator provides information about the agreement between the results of the classifier and the clinician’s perceptual labeling.

4. Results

The best results obtained for each type of test can be observed in Table 2, which disposes the outcomes taking into account the type of characterization, dimensionality reduction, and database subset used. All tests were performed using the aforementioned sets of the database with and without PCA and LDA techniques. Table 3 shows the outcomes adding first and second derivative to the original parameterizations before dimensionality reduction. All results are expressed in terms of efficiency and Cohen’s Kappa Index. For the sake of simplicity, only results obtained with the third labeling of the original subset are shown, corresponding to columns and in Appendix, Table 6.

Concerning trait, absolute best results (81.6%) are obtained in the agreement database, using MS + in 140 ms frames, 22 centroids, Hilbert envelope, 240 Hz as max. modulation frequency, dimensionality reduction through PCA (93% reduction), and 4 GMM. Respecting MFCC, best results are obtained using MFCCs + + , 22 coefficients, PCA, 20 ms frames, and 8 GMM.

Relating to , as expected, absolute best results () are also obtained in the agreement database using MS + + calculated in 100 ms frames, 14 centroids, Hilbert envelope, 240 Hz as max. modulation frequency, dimensionality reduction through LDA, and 16 GMM. Respecting MFCC, best results are obtained using MFCCs + + , 22 coefficients, PCA, 20 ms frames, and 48 GMM.

Table 4 shows confusion matrices for MFCC and MS Morphological Parameters as the sum of the confusion matrices obtained at each of the test folds. They are calculated using the mentioned configurations that leaded to the best results.

5. Conclusion and Discussions

This study presents a new set of parameters based on MS being developed to characterize perturbations of the human voice. The performance of these parameters has been tested with an automatic system that emulates a perceptual assessment according to the and features of the GRBAS scale. The proposed automatic system follows a classical supervised learning setup, based on GMM. The outcomes have been compared to those obtained with a baseline setup using the classic MFCCs as input features. Dimensionality reduction methods as LDA and PCA have been applied to mitigate the curse of dimensionality effects induced by the size of the corpus used to train and validate the system. Best results are obtained with the proposed MS parameters, providing 81.6% and 84.7% of efficiency and 0.73 and 0.76 Cohen’s Kappa Index for and , respectively, in the agreement subset. Having in mind Altman interpretation of Cohen’s index [62], shown in Table 5, the agreement can be considered “good”, almost “excellent.” Likewise, most errors raised by the system correspond with adjacent classes, as it can be deduced from the confusion matrices represented in Table 4. It is noticeable that in many cases the second class (level 1 in traits and ) is not detected properly and the main reason may be the lack of subjects of class 2 (level 1 in and ) in the used corpus. The fact that GMM classifiers were trained with a poor quantity of class 2 frames with respect to the other classes explains the higher percentage of errors obtained for this class. In order to solve this problem in future works it might be necessary to use classification techniques for imbalanced data [63]. Another possible reason for the mismatching of intermediate classes ( and equal to 1 or 2) is that these are the less reliable levels in GRBAS perceptual assessment as it was described by de Bodt et al. [5].

In reference to the outcomes obtained with features without dimensionality reduction, results are better for the agreement subsets using MS Morphological Parameters. Moreover, when applying LDA to the MS feature space, an absolute improvement of a 9% is obtained for in comparison to MFCCs, leading to the best absolute outcome obtained and denoting that the MS Morphological Parameters are in some sense linearly separable. As a starting point, most of the agreement subset tests were performed with what we have called the original subset (224 files) using the three available label groups separately: one of them generated by one of the speech therapists and the other two created by the other specialist in two different sessions. In these cases, in spite of having a higher number of files and a more class-balanced database, results barely exceed 60% of efficiency. This demonstrates that the consistency of the database labeling (i.e., removing the noise introduced during the labeling process due to intra- and interrater’s variability) is crucial to obtain more accurate results. An interesting conclusion is that further studies should utilize only consistent labels obtained in agreement with several therapists and in different evaluation sessions.

In order to search for some evidences proving that the selected cross-validation technique is not influencing the results by producing corpus-adjusted trained models, most of the tests are launched again using a 6-fold cross-validation technique as a prospecting experiment. Almost the same maximum efficiencies were obtained in all cases with a difference of around ±1%, suggesting that the selected cross-validation technique is not producing corpus-adjusted trained models.

Regarding the use of derivatives and , they improve performance mainly when using MFCCs in 20 ms frames for trait. This suggests that derivatives provide relevant extra information related to the short term variations occurred in pathological voices [64]. In the rest of the cases the improvements are limited; therefore, the influence of derivatives in and detection systems should be studied in detail in the future work.

Comparing this work with other studies mentioned in Section 1, results with MFCCs are coherent with these obtained in [17, 21, 36], although methodologies followed in them are different to the one proposed in this study. As it is stated in Section 1, previous studies seldom exceed 75% efficiency. Taking into account and traits, only [19, 20] surpass that value achieving 80% for trait.

Despite the promising results, an accurate comparison with the studies found in the state of the art is difficult since, as stated in [65], different works tend to use different types of corpus and methodologies, and results are unfortunately dependant of the corpus used for training and validation. Furthermore, those cases on which different studies utilize the same database, labeling is usually different. In this sense, the definition of a standardized database with a consistent and known labeling would lead to comparable results. For this reason, with the aim of providing the scientific community with a benchmark labeling and to promote a more solid comparative estimate of future techniques and studies, the labeling of the and features used on this work has been included in Table 6. Despite its known limitations [65], the fact that MEEI database is commercially available for researchers is also an advantage in this sense.

On the other hand, other approaches such as [11, 23] have already demonstrated that MS is a good source of information to detect pathological voices or to perform an automatic pathology classification. The main difference with respect to Markaki’s approach is that in this study MS is used to evaluate the speech according to a 4-level scale in two different features of the speech: Grade and Roughness. On the other hand, the parameters used in the present study are less abstract and have an easier physical interpretation, opening the possibility of using them in a clinical setting.

In spite of the good figures, MS has a weakness which could make it a nonviable parameterization in some applications: computational cost. Depending on the configuration and frequency margins, to calculate a MS matrix can take around 400 times more than to calculate MFCCs on the same signal frame.

Regarding the future work, all MS parameters must be studied and adjusted separately to find the adequate frequency margins of operation to optimize results. In addition, the use of the proposed MS Morphological Parameters in combination with some other features such as complexity and noise measurements or cepstral-based coefficients to characterize GRBAS traits would be advisable. Moreover, the study of regression methods like Support Vector Regression [66] and other feature selection techniques such as Least Absolute Shrinkage and Selection Operator (LASSO) [67] is of interest. In respect of the classification stage, the stratification of the speakers according to her/his sex, age, or emotional state could increase performance as suggested in [68]. For this purpose, a priori categorization of speaker’s characteristics using hierarchical methods might be used to simplify the statistical models behind to automatically assess the quality of speech.

Summarizing, results suggest that the proposed MS Morphological Parameters are an objective basis to help clinicians to assess Grade and Roughness according the GRBAS scale, reducing uncertainty and making the assessment easier to replicate. It would be advisable to study the synthesis of a new parameter combining the proposed MS Morphological Parameters, being suitable for therapists and physicians. In view of the experiments carried out in this work, there are evidences that suggest that the use of these parameters provides better results than the classic MFCCs, traditionally used to characterize voice signals. On the other hand, its main drawback is the initial difficulty of applying the proposed MS-based parameters to the study of running speech.

Appendix

On Table 6 perceptual assessment of GRBAS Grade and Roughness traits for 224 recordings of MEEI corpus [55] can be found. G1 and R1 are assessments of Therapist 1. The rest of evaluations are performed by Therapist 2 in two different sessions. Numbers in bold represent the agreement subsets while all evaluations are the original subsets.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors of this paper have developed their work under the grant of the project TEC2012-38630-C04-01 from the Ministry of Economy and Competitiveness of Spain and Ayudas para la realizacion del doctorado (RR01/2011) from Universidad Politecnica de Madrid, Spain.