Abstract

Due to low intra- and interrater reliability, perceptual voice evaluation should be supported by objective, automatic methods. In this study, text-based, computer-aided prosodic analysis and measurements of connected speech were combined in order to model perceptual evaluation of the German Roughness-Breathiness-Hoarseness (RBH) scheme. 58 connected speech samples (43 women and 15 men; years) containing the German version of the text “The North Wind and the Sun” were evaluated perceptually by 19 speech and voice therapy students according to the RBH scale. For the human-machine correlation, Support Vector Regression with measurements of the vocal fold cycle irregularities (CFx) and the closed phases of vocal fold vibration (CQx) of the Laryngograph and 33 features from a prosodic analysis module were used to model the listeners’ ratings. The best human-machine results for roughness were obtained from a combination of six prosodic features and CFx (, ). These correlations were approximately the same as the interrater agreement among human raters (, ). CQx was one of the substantial features of the hoarseness model. For hoarseness and breathiness, the human-machine agreement was substantially lower. Nevertheless, the automatic analysis method can serve as the basis for a meaningful objective support for perceptual analysis.

1. Introduction

Voice is a perceptual phenomenon, and perceptual evaluation is therefore regarded as a gold standard for voice assessment [1, 2]. Hence, perception-based methods are the basis for the evaluation of voice pathologies in clinical routine, although they are too inconsistent among single raters to establish a standardized and unified classification [3, 4]. With this background of methodological shortcomings, simple rating criteria for perceptual evaluation have been established. Five of them have been combined to form the GRBAS scale [5] (grade, roughness, breathiness, asthenia, and strain). However, the choice of criteria has been criticized: asthenia () and breathiness () correlated very highly with each other in a study by Nawka et al., and the evaluation of the strain () criterion showed a much higher variation than the other criteria. For these reasons, the mentioned working group had developed a reduced version of GRBAS, the Roughness-Breathiness-Hoarseness (RBH) evaluation scheme [6]. It has become an established means for perceptual voice assessment in German-speaking countries.

Automatic, that is, computer-based, assessment may be helpful as an objective support for the subjective evaluation, since it omits the problem of intra- and interrater variation. Perception experiments are often applied to spontaneous speech, standard sentences, or standard texts. About automatic analysis, Maryn et al. reported that 18 out of 25 reviewed studies examined sustained vowels exclusively, four only speech, and three both vowels and speech [7]. For the analysis of speech, mostly one sentence of the English “rainbow passage” was used. Speech recordings have the advantage that they contain onsets, variations of , and pauses [8]. The impression of roughness, for instance, is influenced by the vowel onset fragments [9]. In general, hoarseness is more present and perceptible in long vowels, especially in open vowels, vowels in voiced context, vowels after glottal closure, or in strained vowels [10]. Hence, perceptual evaluation of a vowel and speech can only be adequately compared when the entire vowel with onset is evaluated [11, 12]. For automatic evaluation, some researchers recommend examining only the stable part of an isolated vowel [13], but following these recommendations means that a substantial portion of persons whose phonation is highly irregular cannot be evaluated at all. In particular, the rapid movements of the articulatory organs that are essential for the production of efficient speech require methods of analysis that go beyond the sole use of sustained vowels [14]. In order to diminish this problem, the Laryngograph has been designed to allow vocal fold closure to be monitored, most notably giving a basis for the measurement of aspects of vocal fold vibration which occur during voiced sounds [15].

In order to achieve a more global analysis of speech, the analysis of speech samples should be extended to methods that do not only evaluate voiced sounds. Also unvoiced sounds, words, the speaking rate, the duration and position of pauses within spoken phrases, the fundamental frequency and loudness, and their variations contribute to the complex phenomenon of speech. The analysis of these aspects has been subject of our working group in the field of automatic speech processing and understanding (identification of what was said and what it means) and also in automatic evaluation of voice and speech disorders (computer-based analysis of voice quality and speech properties, such as intelligibility). This analysis is achieved by a program package called the prosody module [1618]. The goal of this work is to identify a computer-based equivalent for the subjective ratings of roughness, breathiness, and hoarseness from speech recordings, which are representative for communication by voice. This is achieved by means of the Laryngograph and prosodic analysis. Both systems of measurement are completely independent from each other.

Binary classification in the two classes “normal speech” and “pathologic speech” was not the goal of this study. Instead, the continuum of degrees of pathology and the continuum of human ratings were supposed to be modeled.

The questions addressed are the following.

How does the combination of prosodic analysis and Laryngograph measurements correspond with the perception-based RBH evaluation by “trained” listeners?

How do the results change when the Laryngograph measurements are left out or used as the only features for modeling the listeners’ ratings?

2. Materials and Methods

2.1. Samples

58 speech samples (43 samples of female and 15 samples of male voices) were used in this study. The age of the persons was between 12.2 and 81.9 years and the average age was 48.7 years with a standard deviation of 17.8 years. The age distribution is shown in Figure 1. The speech samples were recorded at the Medical University Hannover, Department of Phoniatrics and Pedaudiology, within an interval of three months. Only the set of recordings that was acquired during the first visit at the clinics was used of each person. The collection of samples was supposed to be representative, so no further selection was made. For this reason, the database contained deviated voices and also “normal” voices (Table 1). The most frequent pathology was dysphagia (). The subjects were examined by experienced laryngologists and phoniatricians following the standard protocol of the European Laryngological Society [19].

The speech samples contained connected speech, namely, the standard text “Der Nordwind und die Sonne” (“The North Wind and the Sun”) [20] which is frequently used in medical speech evaluation in German-speaking countries. The version used for this study consisted of 109 words. The recordings were made with components of the Laryngograph system [21]. The headset of the system was placed at a distance of 10 cm in front of the reader’s mouth. The speech data were recorded with a sampling frequency of 44.1 kHz and a 16 bit amplitude resolution. For automatic speech analysis, the data were resampled with a 16 kHz sampling frequency. In order to obtain the other Laryngograph measurements, two electrodes were placed superficially on either side of the neck of the subject at the level of the larynx, and a constant amplitude high-frequency voltage (3 MHz) was applied. This setup was chosen in order to ensure conditions which are usual in clinical applications.

The study has respected the principles of the World Medical Association (WMA) Declaration of Helsinki on ethical principles for medical research involving human subjects. All patients had given written consent to the anonymized use of their data for research purposes before the recordings.

2.2. Perceptual Evaluation

The perceptual evaluation of the text recordings according to clinical standards was done by 19 speech and voice therapy students (3rd year female students, study course on speech therapy at the Fresenius University of Applied Sciences, Idstein, Germany) using the RBH scale [6]. The students had learned about the RBH scheme from the beginning of their education. In the third year, they have sufficient theoretical and practical knowledge about voice evaluation, the ability to interpret larynx-related diagnoses, and practical experience, since they have also undergone practical training including therapy lessons by themselves under supervision.

Before the listening task, detailed instruction was given to the students by the study tutors. During the task, no further information was given, however. The raters listened to each speech sample once. This was sufficient since the duration of one recoding was 46 seconds on the average. Between two samples, there was a pause to note down the results. The students were not allowed to discuss their impression with the other raters.

For one speech sample, each of the RBH criteria, that is, roughness, breathiness, and hoarseness, can be evaluated on a 4-point scale where “0” means “absent” and “3” means “high degree.” Originally it was believed that hoarseness is distinct of the other two categories, roughness and breathiness [22]. The RBH instead assumes that hoarseness is a superclass of them [23]. In order to capture the fact that hoarseness is the superclass, the rating value must usually be at least as high as and . For this study, however, this latter rule was not applied, and the students were told to evaluate hoarseness on the 4-point scale just by their impression of the replayed speech. This procedure has already been performed in several other studies in Germany [2426].

2.3. Laryngograph Measurements

The Laryngograph measures the time and degree of contact between the vocal folds by the application of two electrodes which are placed on the neck. The electroglottogram serves as the basis for the computation of several measures. Two of them have been used in this study and will be explained below. Although the voiced excitation of the vocal tract is a complex activity, it has two main time-dependent characteristics. The first one is derived from the duration of excitation of the vocal tract, when the closure of the vocal folds produces its main acoustic signal; the second one relates to the period during which the vocal folds are effectively closed [21]. The fundamental frequency () is usually estimated from short-time windows and based on average values from several vocal fold cycles, which may also be fragmented at the boundaries of the analysis window. A period-synchronous analysis is more exact since it takes into account only full cycles and can also consider period-to-period variations that are often of perceptual importance. These variations of the period frequency values Fx are denoted as CFx in the Laryngograph software. Another measuring factor, which provides information about perceived voice quality, is the changes CQx of the contact phase Qx. The latter is directly related to the ratio of the closed phase of vocal fold vibration to the total period of time between two successive epochs of excitation [21]. In this study, CFx and CQx were used in combination with prosodic features to describe voice quality. Both values are given in percent.

2.4. Prosodic Features

The computation of the prosodic features is independent from the Laryngograph. A speech recognition system [27] detects the spoken words and their positions in the speech recordings. Then the prosodic analysis module [16] computes a vector of prosodic features for each word. There are three basic groups of features. Duration features represent word and pause durations. Energy features contain information about maximum and minimum energy, their respective positions in the word, the energy regression coefficient, and the mean square error. Similarly, the features, based on the detected fundamental frequency, comprise information about the extreme values and their positions, voice onset and offset with their positions, and also the regression coefficient and the mean square error of the trajectory. Duration, energy, and values are stored as absolute and as normalized values. The basic features are computed in different contexts, that is, in intervals containing a single word or pause only or a word-pause-word interval. In this way, 33 features were computed for each word (see Table 2) [17, 28, 29].

Besides the 33 local features per word, 15 “global” features were computed for intervals of 15 words length each. They were derived from jitter (fluctuations of ), shimmer (fluctuations of intensity), and the number of detected voiced and unvoiced sections in the speech signal [28]. They covered the means and standard deviations of jitter and shimmer, the number, length, and maximum length of voiced and unvoiced sections, the ratio of the numbers of voiced and unvoiced sections, the ratio of the length of the voiced sections to the length of the signal, and the same for unvoiced sections. The last feature was the standard deviation of .

The listeners gave ratings for the entire text. In order to receive also one single value for each feature that could be compared to the human ratings, the average of each prosodic feature over the entire recording served as final feature value.

2.5. Support Vector Regression

A Support Vector Machine (SVM) performs a binary classification based on a hyperplane separation between two class areas in a multi-dimensional feature space. SVMs can also be used for Support Vector Regression (SVR) [30]. The general idea of regression is to use the element vectors of the training set to approximate a function which tries to predict the target value of a given vector of the test set. In this study, the sequential minimal optimization algorithm (SMO) [30] of the Weka toolbox [31] was applied for this purpose. The automatically computed prosodic features and the CFx and CQx values served as the training set for the regression, and the test set consisted of the perceptually assessed RBH scores. For each of , , and , one separate regression was computed.

In order to find the best subset of the computed features to model the subjective ratings, a correlation-based feature selection method ([32], pp. 59–61) was applied in a 10-fold cross-validation manner. The features with the highest ranks were then used as the input for the SVR.

2.6. Human-Machine Correlation

Statistical analysis was performed using Weka and in-house programs. The interrater reliability for the entire rater group was measured using Krippendorff’s [33]. Many studies use Cronbach’s , but this measure eliminates the influences of different tendencies in rating since the mean values are neglected. In order to examine human-machine correlation, the automatic measurement for each rating criterion of each recording was compared to the average value of the 19 raters’ evaluation. The correlations between different measurements and rating criteria were computed using Pearson’s correlation coefficient and Spearman’s rank-order correlation coefficient . Other measures, like Cohen’s or Krippendorff’s , were not used for this purpose due to the different domains of human and machine evaluation. This means, for instance, that continuous intervals of the prosodic features or the Laryngograph values would have to be mapped to the discrete values of the RBH components, which is another possible source of error [34].

3. Results

3.1. Perceptual Data

The average values for the perceptual rating criteria are given in Table 3. The data showed a broad range of persons with minimal values of , , and , respectively, to maximum values for , , and around 2. A large variety in the evaluation results was observed within the rater group as well (Figures 24). The interrater values for the 19 listeners were for roughness, for breathiness, and for hoarseness (Table 3). Correlations between the rating criteria are given in Table 4. The criteria roughness and breathiness correlate only moderately with each other. The strongest correlation is between breathiness and hoarseness (, ).

3.2. Human-Machine Correlation

The correlations between the perceptual evaluation and the automatic measurements after the SVR are given in Table 5. The best set for roughness () achieves (). It contains the duration of a word-pause-word interval (DurNormWPW), the mean and minimum within a word (F0MeanW, F0MinW), mean jitter and shimmer averaged on 15-word sections (MeanJitter, MeanShimmer), the number of sections detected as voiced (#+Voiced), and CFx. Without CFx, only () is reached (set w/o CFx). The duration feature can also be left out without changing the correlations significantly (sets and w/o CFx). The same feature is in the best set for breathiness modeling (), which, however, was far less successful in modeling the reference with (). Still, this correlation is highly significant. Neither CFx nor CQx are included in the breathiness model. For hoarseness, there are four different results, denoted to . The best correlation is () for a combination of word duration (DurNormW), the voice offset position within single words (F0OffPosW), the normalized energy within words (EnNormW), the “global” number of voiced sections in the recording (#+Voiced), and the ratio between the numbers of voiced and unvoiced sections (RelNum+/−Voiced). CQx is also essential for the best feature set for hoarseness. Without CQx, the set reaches only human-machine correlations of about 0.35; with CFx instead of CQx, the highest values are below 0.5. Figures 57 show the perceptual evaluations, that is, the average of the 19 raters, and the regression values of the SVR for the best feature sets.

Table 6 shows the human-machine correlations for combinations of CFx and CQx only. These two measures can model the perceptual impression of hoarseness moderately (, ), while they are only weakly correlated with roughness and breathiness. The distribution of these measurements is shown in Figures 810.

4. Discussion

The Laryngograph is an established means of voice evaluation [14, 35]. The main purpose of this study was to determine the correlation between the German RBH evaluation scheme and a combination of text-based prosodic features and measurements from the Laryngograph. The best combination of features yielded a human-machine correlation for roughness of (). The interrater correlation for one rater against the average of all others was (). Hence, the automatic analysis can evaluate roughness as reliable as an “average” rater from the group of the 19 speech and voice therapy students. For hoarseness, the automatic method reached almost the same correlation with the reference as the listeners among themselves. Only the breathiness rating could not be modeled satisfactorily. Additionally, dropping one of the feature sets from the automatic evaluation leads to significantly worse results.

For the modeling of roughness, the duration of a word-pause-word interval (DurNormWPW) may contribute to the most successful set of features because the anatomic alterations, which are the reason for the deviated voice, may also cause a greater speaking effort. This effect has been shown for substitute voices of laryngectomized persons [17], and it might also be valid for the data in this study. The contribution of DurNormWPW to the regression sum is, however, very small.

The impact of the values F0MinW and F0MeanW can be explained by the properties of the detection algorithm, which does a voiced-unvoiced decision first. On all of the 16 ms speech frames that were classified as voiced, the program performed detection. The algorithm by Bagshaw et al. [36] that was used for the task is very robust against distortions. However, noisy speech may result in octave errors, that is, instead of the real fundamental frequency the double, triple, or half of the actual value is found. More “noisy” speech influences the trajectory and thus the correlation with the subjective results [18].

A similar case is the relevance of text-based jitter and shimmer for the model of the roughness evaluations. Both are well-known detectors for voice problems, and the number of segments in the recording which were detected as voiced corresponds with these findings. If a voice is very irregular, then the number of segments detected as voiced by the prosody module will be very low. A difficulty for the comparison of these results with other studies, however, is that the terms “jitter” and “shimmer” disguise a plethora of different algorithms, across many different software vendors and research groups [37]. Many studies give no algorithm details. Additionally, irregularity measures from sustained, isolated vowels and running speech cannot be directly compared due to coarticulatory effects and differences in voice onset and offset.

In this study, also the CFx value appeared to be essential for the good human-machine correlation for roughness. When it was missing, the correlation dropped down to (). CFx is also related to variations of , but it is period-synchronous instead of being based on fixed-length windows. That is on the one hand an advantage against the traditional computation of jitter. On the other hand, the low correlation between CFx and jitter values (Table 7) indicates that both are containing important but independent information.

Breathiness can be modeled only weakly by the available features. While the human-human correlation was (), the maximum for the automatic analysis was (). Here, the duration of a word-pause-word interval contributes very strongly. The reason may be that the continuous leaking of air at the glottis leads to longer or more frequent pauses.

The contribution of the value at voice onset (F0OnsetW) may be based upon octave errors by the detection algorithm again. So far, it is not clear why only the beginning of voiced sections causes a noticeable effect. There may be a connection to changes in the airstream between the beginning and end of words or phrases. It may have its reason in the high speaking effort in the dysphonic voice which leads to more irregularities, especially in these positions, but this has to be confirmed by more detailed experiments on larger and homogeneous databases.

The influence of the normalized energy in the breathiness model was only relevant when it was measured within one word (EnNormW) and not in a word-pause-word interval. Hence, breathing noise in pauses does not contribute to the result, although the duration of the pauses may be important, as pointed out above. The sign of the weighting factor (−0.247) is negative, so the breathier the voice is, the weaker it is and the higher the human evaluation is.

Jitter is also an important factor for the evaluation of breathiness; however, not all authors of other studies agree [38, 39]. Shimmer shows only a very low contribution, but the standard deviation of shimmer within longer text passages, that is, the fluctuations of the fluctuations of energy, seems to be characteristic for breathiness.

Neither CFx nor CQx were in the optimal set for breathiness evaluation.

For hoarseness, many features were in the best subsets that were also relevant for roughness and breathiness. This supports the assumption of Nawka et al. that hoarseness is a superclass of the other two criteria [6], although the students did not evaluate the data with this rule in mind explicitly. The feature set modeling the raters’ decisions best reached a correlation of () to that reference; the interrater correlation was (). Like for breathiness, the duration is important, but only on single words, not on word-pause-word intervals. Replacing the feature with the latter variant yields much worse correlations (Table 5, column ), as did using the word-based feature for modeling roughness.

The normalized energy within words (EnNormW) is, like for breathiness, another important feature. Replacing it with the word-pause-word variant (EnNormWPW) was not successful (Table 5, columns and ).

The average of jitter contributes to the hoarseness model even more than to the two other categories.

The position of the voice offset within a word (F0OffPosW), which did not occur in the roughness and breathiness modeling, is a nonnegligible factor for hoarseness evaluation. This has already been detected in a previous study with chronically hoarse persons who were evaluated by five voice experts [18]. The reason is very probably the detection algorithm and its decisions regarding voiced and unvoiced sections again.

Shimmer was not relevant for hoarseness at all in the results, although it showed contributions to the regression sum of roughness and breathiness. This supports, in contrast to Nawka’s assumption, the hypothesis that hoarseness may be more than just the superclass of the other categories.

As with roughness, the number of sections that are classified as voiced (#+Voiced) is important for hoarseness evaluation. Additionally, the ratio of the numbers of voiced and voiceless sections (RelNum+/−Voiced) supports the results.

The high correlation of perceptual and evaluations shows that for the evaluation of overall hoarseness the raters were closer to the breathiness rating than to the roughness rating. This is in contrast to another study of our group, where roughness and hoarseness had a higher correlation () [34]. For that study, however, the restriction was applied, and only five speech therapists with several years of experience in voice evaluation had rated the data. In this new study, there was also a large variety of ratings among the 19 listeners. Therapists with many years of practical experience may show less disagreement [40], but according to the fact that the raters of our study had undergone almost three years of practical education before, we believe that they already developed a rather stable personal model of voice evaluation. The influence of these factors on our particular data has to be examined in future work.

The automatic modeling of the hoarseness and especially the breathiness ratings was not as successful as for roughness. The set of available measures and prosodic features was not sufficient to depict the various ratings of the large rater group satisfyingly so far. Nevertheless, the method presented here may be the basis for a meaningful objective support and an addition to perceptual analysis in clinical practice. Another important advantage of the presented method is that it does not just classify voices into one of the two categories “normal” and “pathologic.” For quantification of a communication disorder in clinical use, this is not sufficient. Instead, the experiments provided regression formulae which can be used to translate the obtained measures onto the whole range of perceptual ratings.

A complete match of subjective and automatic evaluation was not expected. On the one hand, disagreement on which acoustic properties or measures represent which perceptual impression may still be present; on the other hand the automatic assessment can only be based on a stimulus which for perceptual evaluation is further processed within the listener. Hence, the sources of information for both methods are different. The process of perception may evaluate more or different information than the automatic methods. Additionally, there is also some possible improvement for the technical methods which is part of future work. As an example, the speech recognition module, which is supposed to provide the word hypotheses graph for the computation of the prosodic features, can be improved by adaptive methods to enhance the phoneme models for distorted speech [41]. For these reasons, we regard this study as a pilot study. Furthermore, the automatic evaluation is not supposed to be a full replacement for the subjective assessment, but an additional source of information which yields reproducible results.

5. Conclusions

Combined prosodic and Laryngograph-based analysis corresponds as well with the average perception-based roughness evaluation as a group of professional raters themselves on a clinical representative group of patients with a broad distribution of voice pathology. It can serve as an additional source of knowledge or an objective guideline in the clinics where perceptual evaluations are usually performed by a single person only.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This study was partially funded by the Else Kröner-Fresenius-Stiftung, Bad Homburg v.d.H., Germany, under Grant no. 2011_A167.