Abstract

To facilitate the process of developing speech perception, speech-language pathologists have to teach a subject with hearing loss the differences between two syllables by manually enhancing acoustic cues of speech. However, this process is time consuming and difficult. Thus, this study proposes an objective approach to automatically identify the regions of spectral distinctiveness between two syllables, which is used for speech-perception training. To accurately represent the characteristics of speech, mel-frequency cepstrum coefficients are selected as analytical parameters. The mismatch between two syllables in time domain is handled by dynamic time warping. Further, a filter bank is adopted to estimate the components in different frequency bands, which are also represented as mel-frequency cepstrum coefficients. The spectral distinctiveness in different frequency bands is then easily estimated by using Euclidean metrics. Finally, a morphological gradient operator is applied to automatically identify the regions of spectral distinctiveness. To evaluate the proposed approach, the identified regions are manipulated and then the manipulated syllables are measured by a close-set based speech-perception test. The experimental results demonstrated that the identified regions of spectral distinctiveness are very useful in speech perception, which indeed can help speech-language pathologists in speech-perception training.

1. Introduction

Hearing loss would seriously degrade a subject’s speech perception, thereby affecting the development of articulation ability. It then reduces speech intelligibility and affects speech and language development, learning, and communication. Recently, assistive listening devices such as hearing aids or cochlear implants could help subjects with hearing loss utilize their residual hearing and develop their speech perception [17]. To facilitate this process, speech-language pathologists (SLPs) have to provide speech-perception training that could increase subject’s ability to distinguish one syllable from another. In clinical practice, SLPs manually enhance the distinguishable acoustic cues and extensively use them in speech-perception training. However, it is a time-consuming and expensive process. Thus, it is beneficial for SLPs in speech-perception training if the regions of spectral distinctiveness between two syllables can be identified automatically.

When a speech wave propagates on the basilar membrane, it is characterized as time-spectral patterns. Then, unique perceptual cues, which are the basic units for speech perception, can be identified. Therefore, the relation between the acoustic cues and perceptual units is a key problem for speech perception [810]. In the last decade, the relation had been examined [1118] and the results show that the main factors of acoustic cues are duration, stress, and spectral distinctiveness.

SLPs generally increase the duration and stress of a syllable to teach a subject how to distinguish one syllable from another. Duration and stress can be simply manipulated by speech techniques [19]. However, the spectral distinctiveness between two syllables is very difficult to be identified. In clinical practice, SLPs have to repeatedly pronounce a syllable by enhancing the volume of part segment of a syllable. However, it is a complicated task for SLPs to manually enhance spectral distinctiveness of a syllable. In order to identify the regions of spectral distinctiveness, Li et al. proposed a psychoacoustic method in three dimensions: time, frequency, and intensity [20]; still it is a time-consuming process. Moreover, it is difficult to apply to other languages. Hence, to automatically identify the regions of spectral distinctiveness is very important for hearing impairment in speech-perception training.

In this study, an objective approach to identify the regions of spectral distinctiveness is proposed. The mel-frequency cepstrum coefficients (MFCCs) are selected as analytical parameters and used to represent the characteristics of acoustic signal. The mismatch between two syllables in time domain is handled by dynamic time warping; thereby, an optimal matching condition could be obtained. To accurately estimate the spectral similarity, filter bank is applied to find the spectral components of different frequency bands. For the speech signal in each frequency band, the MFCCs are also extracted to represent the acoustical characteristics. According to the optimal matching condition, the spectral distinctiveness of each frequency band between two syllables can be estimated easily by using Euclidean metrics. Finally, the morphological gradient operator is developed to automatically identify the regions of spectral distinctiveness. Moreover, in order to evaluate the accuracy of identified regions of spectral distinctiveness, an acoustic cue manipulation is proposed in this study.

The rest of this paper is organized as follows. Section 2 describes the objective approach to identify spectral distinctiveness including feature extraction, spectral distinctiveness estimation, and spectral distinctiveness identification. Besides, the acoustic cue manipulation is introduced. Section 3 then describes a series of experiments to examine the performance of our approach. Finally, conclusions are drawn in Section 4, along with recommendations for future research.

2. Materials and Methods

In this section, the proposed objective approach to identify the regions of spectral distinctiveness between two syllables (as shown in Figure 1) is introduced. Firstly, the MFCCs are extracted from the input speech signals and the filtered speech signals. Secondly, the distance between the MFCCs of two syllables is computed and used to find the consonant-vowel boundary. This approach also adopts the dynamic time warping to find an optimal matching condition between two input syllables. Thirdly, according to the optimal matching condition, the spectral distinctiveness of each frequency band can be easily estimated by using Euclidean metric. Finally, a morphological gradient operator is applied to automatically identify the regions of spectral distinctiveness. To examine the proposed approach, an acoustic cues manipulation is also proposed to manipulate the regions of spectral distinctiveness. These procedures are illustrated serially in the following subsections.

2.1. Feature Extraction

Analytical parameters which can accurately represent a speech signal play an important role in objective measurement of spectral distinctiveness. Since, MFCCs had been widely used in speech processing, especially speech recognition [21], they are very suitable for accurately representing not only a speech signal but also speech signals in different frequency bands. The procedure to extract MFCCs from a speech signal is illustrated as follows:(1)taking the Fourier transform of frames windowed from input speech signal;(2)mapping the powers of the spectrum onto the mel scale which is defined as where is the frequency (Hz) in linear domain;(3)using triangular overlapping windows to get the power spectrum in mel scale;(4)taking the logs of the powers at each of the mel frequencies which is denoted as ;(5)taking the discrete cosine transform of the mel log powers which is defined as where is the length of window size and is defined as (6)finally, MFCCs are composed of amplitudes of the resulting spectrum, .

In Mandarin, a syllable can be decomposed into an INITIAL and a FINAL. INITIALs consist of consonants or semivowels, and FINALs consist of vowels or vowels plus one of the two nasal sounds. Thus, two syllables represented as and are used to estimate the regions of spectral distinctiveness. Let and , respectively, represent the MFCCs of and , in this study.

In addition, a filter bank is adopted to separate the input signal into multiple components, which are the acoustical characteristics of frequency bands. For the th frequency band, the corresponding MFCCs of and are also extracted and denoted as and , respectively.

2.2. Spectral Distinctiveness Estimation

In order to estimate the spectral distinctiveness between two syllables, the mismatch should be dealt; thereby, the difference in frequency bands can be easily estimated by using Euclidean metric. Therefore, the dynamic time warping algorithm is adopted to compare two sequences , and ; thereby, a plane spanned by and is considered as a distance matrix , which is written as where is the Euclidean distance between and . The matching condition indicating the correspondence between the time axes of and can be represented a sequence of lattice points on the plane and written as where is the th matching pair in . The is the best matching condition and then the dynamic time warping algorithm is described as shown in Algorithm 1.

Step 1. Initialization
  DTW(1, 1) =
  Path(1, 1) = (0, 0)
  For to
    DTW( , 1) = + DTW( , 1)
    Path( , 1) = ( , 1)
  End For
  For to
    DTW(1, ) = + DTW(1, )
    Path(1, ) = (1, )
  End For
Step 2. Iteration
  For to
    For to
      DTW
      Path
    End For
  End For
Step 3. Backtracking and Termination
  The optimal (minimum) distance is DTW( , ).
  The optimal matching path H is found by simple backtracking from Path( , ).
  
   = ( , )
  While Do
    
     = Path
  End While

In the dynamic time warping algorithm, the variable is used to store a path, which reaches at lattice point (a, b) with minimum accumulative distance. The minimum accumulative distance is stored in variable . Since the durations are quite different for each INITIAL, the boundary condition should be ignored in this study. Hence, the path of the lattice point at first row and column goes through its previous lattice point at the step 1. In order to stop the backtrack for finding the optimal path, path(1, 1) is set to be (0, 0). In addition, monotony and continuity condition is applied to be concerned in the matching condition at step 2. It means that the search space of lattice point includes three lattice points: , and . At step 3, a simple method is implemented to find an optimal matching condition by backtracking from lattice point .

According to the optimal matching condition and a distinguishable matrix for syllable , , can be estimated as where is the number of filter bands and is the distance for frame at th frequency band. can be defined as where is the number of frames in which is matched with frame in .

2.3. Spectral Distinctiveness Identification

The spectral distinctiveness between two syllables is estimated; thereby, the regions of spectral distinctiveness should be identified from the distinguishable matrix . Recently, the grayscale morphological gradient operator is a powerful and fast technique for both contour detection and region based segmentation [22]. Thus, it can be successfully used to detect the regions of spectral distinctiveness. To obtain the regions of spectral distinctiveness from , the morphological gradient operator is used and defined as where is the scale of morphological gradient operator and denotes the group of square structuring elements.

In (8), the symbols and are the grayscale dilation and grayscale erosion, which are defined as follows: where is a flat structuring element. and are the domains of and , respectively. According to (9), the grayscale opening and closing then can be derived as

2.4. Acoustic Cues Manipulation

The accuracy of identify regions of spectral distinctiveness was examined in this subsection. Therefore, the power of the spectrogram of these regions should be manipulated to examine that a syllable is converted to another. Thus, a speech modification procedure based on short-time Fourier transform (STFT) is proposed to analyze a speech sound and then synthesize an enhanced speech sound [23].

Let denote the speech signal at sample times . A Hamming window is used to divide into point overlapping frames, which can be written as where is the step size and is defined as . Therefore, the resulting STFT coefficients can be derived as To improve the accuracy of modification, the windowed speech is zero-padded before performing the Fourier transform.

The region of spectral distinctiveness is then modified by multiplying a specific gain . Specifically, indicates feature removal, corresponds to feature attenuation, and represents feature enhancement. Thus, the modified speech spectrum can be written as Generally, the gain is expressed in dB as

According to , the single frame signal is recovered by applying an inverse Fourier transform, which is defined as follow: Finally, an overlap-add synthesis is used to generate the modified speech in time domain, which can be written as

3. Results and Discussions

To evaluate the proposed approach, a close-set based speech-perception test on stop consonants was performed in this study. The speech stimuli, including the syllables /da, ga, ka, ba, pa, ta/, were chosen from the University of Pennsylvania’s Linguistic Data Consortium (LDC) LDC2005S22. The detailed experimental results are shown as follows.

3.1. Results of Manipulating /ta/ and /ka/

In this subsection, the syllables /ta/ and /ka/ were used to illustrate the results of the proposed approach. First, the regions of spectral distinctiveness were manually identified to check the results of our approach. The spectrograms of /ta/ and /ka/ were shown in Figure 2. It is obvious that /ta/ has high-frequency burst above 4 k Hz (marked as black rectangle) and /ka/ has a low-frequency burst about 1 k Hz (marked as black rectangle). These two regions should be very important to distinguish /ta/ from /ka/.

Second, the results of distance matrix and dynamic time warping algorithm are examined here. The distance matrix estimated from /ta/ and /ka/ is shown in Figure 3(a). In this figure, the FINALs of /ta/ and /ka/ are the same; then the distances between the speech segments of FINALs are very small. The distances between the speech segments of FINALs and those of INITIALs are much large. Further, the speech segments of INITIALs for /ta/ and /ka/ are also quite different. Obviously, there exists an optimal path in 45-degree. In Figure 3(b), the optimal matching path can be successfully detected by dynamic time warping algorithm. It demonstrates that the mismatch between two syllables in time domain can be handled in our approach. Besides, the first frame and the third frame of /ka/ match with three frames of /ta/; thus, the first frame and the third frame of /ka/ should be duplicated three times when /ka/ is manipulated as /ta/. It also shows that our approach can be applied to correctly increase the duration of a syllable.

Third, the spectral distinctiveness between /ta/ and /ka/ measured by our approach is validated with manually identified results. Therefore, with the optimal matching path (shown in Figure 3(b)), the MFCCs of /ta/ and /ka/ in different frequency bands were adopted to estimate the spectral distinctiveness. The results were normalized in time domain and shown in Figure 4. Comparing Figures 4 and 2, it is obvious that the differences in spectrogram are precisely estimated. By selecting a suitable threshold in morphological gradient operator, the regions of spectral distinctiveness for /ta/ and /ka/ can be identified (shown in Figure 5), which are very similar to the expected regions (shown in Figure 2).

Finally, the regions of spectral distinctiveness should be manipulated to examine the accuracy by a subject. According to the identified regions in Figure 5, the acoustic cues manipulation was applied to modify /ta/ and /ka/. /ta/ and /ka/ were converted to /ka/ (denoted as /ta → ka/) and /ta/ (denoted as /ka → ta/), respectively. Then, the spectrograms of /ta → ka/ and /ka → ta/ were shown in Figure 6. Comparing Figure 6(a) with Figure 2(b), the spectral energy about 1 k Hz had been relatively decreased. Comparing Figure 6(b) with Figure 2(a), the spectral energy above 4 K increases. Therefore, /ta/ and /ka/ are heard as /ka/ and /ta/, respectively. So, the identified regions really play an important role in distinguishing one syllable from another.

3.2. Experimental Results of Subject Evaluation

In this subsection, the results of manipulated syllables are used to examine the identified regions of spectral distinctiveness. Seven males and three females (college students, age about 30 years) were asked to participate in this study. Each token with and without manipulation was randomly presented to each subject 5 times. The speech stimuli were played at the most comfortable level (around 70 dB SPL) for the listeners. The parameters of gain in (14) were set to be 3 dB, 6 dB, 9 dB, and 12 dB. After each presentation, subjects responded to the stimulus by clicking on one of two buttons labeled with syllables. The detailed results in recognition rate (%) are shown in Table 1. The experimental results show that the recognition rate is over 86%. Moreover, the average recognition rates of manipulated syllables are 89.53%, 91.27%, 92.87%, and 92.80% for which is 3 dB, 6 dB, 9 dB, and 12 dB, respectively. When the gain is set to be 9 dB, it can achieve the best recognition rate. Then, speech intelligibility is distorted for the larger gain.

To objectively compare these results, the syllables without manipulation were also used and the results are shown in Table 2. The average of recognition rate is 94.93% which is very similar to that of syllables with manipulation . It means that a syllable can be heard as another syllable by manipulating these regions of spectral distinctiveness. Hence, the identified regions of spectral distinctiveness really play an important role in speech perception. SLPs then can apply the identified regions of spectral distinctiveness to help a subject with hearing loss increase his/her ability to distinguish one syllable from another; thereby, the process of speech-perception training then can be facilitated.

4. Conclusions

In this study, an objective approach is proposed to identify the regions of spectral distinctiveness between two syllables. The MFCCs are appropriate to represent not only the speech signal but also the speech components in different frequency bands. In addition, the use of the dynamic time warping overcomes the mismatch between two speech signals in time domain. According to the optimal matching condition, the spectral distinctiveness of each frequency band between two syllables is easily estimated by using Euclidean metrics. The regions of spectral distinctiveness are precisely identified by morphological gradient operator. The experimental results demonstrate that the identify regions play an important role in distinguishing one syllable from another. In the future, the regions of spectral distinctiveness should be automatically enhanced and extensively used in speech-perception training; then it can efficiently reduce the loading of SLPs and facilitate the process of developing speech perception.

Acknowledgment

The authors would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this work under Contract NSC 102-2221-E-218-001.