Abstract

In spoken word recognition, one of the crucial points is to identify the vowel phonemes. This paper describes an Artificial Neural Network (ANN) based algorithm developed for the segmentation and recognition of the vowel phonemes of Assamese language from some words containing those vowels. Self-Organizing Map (SOM) trained with a various number of iterations is used to segment the word into its constituent phonemes. Later, Probabilistic Neural Network (PNN) trained with clean vowel phonemes is used to recognize the vowel segment from the six different SOM segmented phonemes. One of the important aspects of the proposed algorithm is that it proves the validation of the recognized vowel by checking its first formant frequency. The first formant frequency of all the Assamese vowels is predetermined by estimating pole or formant location from the linear prediction (LP) model of the vocal tract. The proposed algorithm shows a high recognition performance in comparison to the conventional Discrete Wavelet Transform (DWT) based segmentation.

1. Introduction

Most languages, including Assamese, have a fixed number of vowel phonemes. Vowel sounds play the most significant role in the production of different words. To develop an effective speech recognition system, it is always important to recognize the vowel phonemes first. The first step towards vowel recognition is to segment the vowel phonemes from the word. Speech segmentation is usually performed using constant time blocks, for example, using fixed length windows [1]. But constant segmentation risk loses information about the exact phoneme boundary. A more satisfactory approach is the Discrete Wavelet Transform (DWT) based speech segmentation. Wavelet decomposition is used as a segmentation technique in various biomedical applications like automatic EEG artifact removal, fetal electrocardiogram (fECG) extraction, muscle activation detection, and so forth. A few works are reported in [26]. In case of speech segmentation like application, DWT can easily extract the speech parameters which take into account the properties of the human hearing system. But the success rate obtained by DWT based segmentation can be improved. This work presents a novel vowel segmentation algorithm based on two different supervised and unsupervised Artificial Neural Network (ANN) structures. Self Organizing Map (SOM) trained with six different iteration numbers is used to segment the word whose vowel parts are to be recognized. A trained SOM provides the most appropriate set of weights for the purpose which in this case numbers six. Any one of the six weight vectors thus obtained is considered to be representing the vowel phoneme. Which particular weight vector is the vowel segment is identified by pattern matching with a few two class Probabilistic Neural Network (PNN) blocks, which are trained with clean vowel phonemes uttered by five female and five male speakers. The validation of the recognized vowel phoneme is proved by checking its first formant frequency (F1). The F1s for all the vowels of Assamese language are calculated a priori using the method of pole or formant location determination from the linear prediction (LP) model of the vocal tract. The sample Assamese words and clean vowel sounds are recorded from ten native Assamese boy and girl speakers for this work. Assamese is a demanding language of North-East India both in pronunciation and speed of speaking and is being considered as a fairly accurate case study for speech researchers.

This paper provides a comparative analysis between vowel segmentation based on conventional DWT and the proposed ANN aided approach. Experimental results prove the superiority of the proposed method. The description included here is organized as below. Section 2 provides briefly the phonemical details of Assamese language. The paper, next, includes a brief account on the DWT based vowel segmentation considerations. The proposed SOM based segmentation algorithm and PNN and F1 based recognition part are described in Section 5 and Section 6, respectively. The results and the related discussion are included in Section 7. Section 8 concludes the description.

2. Certain Phonemical Features of Assamese Language

Assamese is an Indo-Aryan language originated from the Vedic dialects, and therefore, it is a sister of all the northern Indian languages. Although the exact nature of the origin and growth of the language is yet to be clear, it is supposed that like other Aryan languages, Assamese was also born from Apabhraśa dialects developed from Mãgadhi Prakrit of the eastern group of Sanskritic languages [7]. Retaining certain features of its parent Indo-European family, it has got many unique phonological characteristics which make Assamese speech unique and hence it requires a study exclusively directed towards the development of a speech recognition/synthesis system in Assamese [8].

There are twenty-three consonants and eight vowel phonemes in the standard colloquial Assamese. The Assamese vowel phoneme table obtained from [7] is shown in Table 1. The eight vowels present three different types of contrasts. Firstly, they present an eight-way contrast in closed syllables, and in open syllables when /iu/ does not follow in the next immediate syllable with the intervention of a single consonant except the nasal. Again, they show a six-way contrast in open syllables with /i/ occurring in the immediately following syllable with the intervention of any single consonant except the nasals, or except with nasalization and finally, a five-way contrasts in open syllables when /u/ occurs in the immediately following syllable with a single consonant intervening [7].

3. Linear Prediction Model of Speech

The speech signal is produced by the action of the vocal tract over the excitation coming from the glottis. Different conformations of the vocal tract produce different resonances that amplify frequency components of the excitation, resulting in different sounds [9]. The resonance frequencies resulting from a particular configuration of the articulators are instrumental in forming the sound corresponding to a given phoneme and are called the formant frequencies of the sound [10].

The linear predictive model is based on a mathematical approximation of the vocal tract. At a particular time , the speech sample is represented as a linear sum of the previous samples. In LP model of speech, each complex pole pair corresponds to a second order resonator. The resonance frequency of each pole is associated with a peak in spectral energy or a formant candidate. The pole radius is related to the concentration of local energy and the bandwidth of a formant candidate [11].

The source filter model of the vocal tract system can be represented by discrete time linear time invariant filter. The short-time frequency response of the linear system simulates the frequency shaping of the vocal tract system, and since the vocal tract changes shape relatively slowly, it is reasonable to assume that the linear system response does not vary over time intervals on the order of 10 ms or so. Thus, it is common to characterize the discrete time linear system by a system function of the form given in where the filter coefficients and change at a rate on the order of 50 to 100 times/s. Those poles of the system function lie close to the unit circle, create resonances to model the formant frequencies [10]. Over short-time intervals, the linear system as given by (1) can be described by an all-pole system function of the form: The major advantage of this model is that the gain parameter and the filter coefficients can be estimated in a very straightforward and computationally efficient manner by the method of a linear predictive analysis.

The linear prediction error sequence possesses the form of the output of an FIR linear system whose system function is Thus, the prediction error filter, , is an inverse filter for the system, , that is, According to (4), the zeros of are the poles of . Therefore, if the model order is chosen judiciously, then it can be expected that roughly of the roots will be close in frequency to the formant frequencies [10].

The pole location from the prediction polynomial can be calculated by solving for the roots of the equation . Each pair of complex roots is used to calculate the corresponding formant frequency and bandwidth. The complex root pairs and sampling frequency to formant frequency and 3 db bandwidth possess the transformation of the form given by the following pair of equations [11]:

4. The DWT Based Segmentation

In DWT, a time-scale representation of the signal to be analyzed is passed through filters with different cut-off frequencies at different scales. Wavelets can be realized by iteration of filters with rescaling. The resolution of the signal, which is a measure of the amount of detailed information in the signal, is determined by the filtering operations, and the scale is determined by upsampling and downsampling operations [1214]. In this work, we have performed a six-level decomposition of the speech signal which covers the frequency band of a human voice.

Figure 1 shows the DWT segmentation block. The DWT segmented phonemes are stored as , , , , , and , which are later used for recognizing the vowel phonemes by the recognition algorithm (Section 6). At each level, signal is reconstructed just by setting all the rest wavelet coefficients to zeros except the level at which signal is to be reconstructed. It is observed that if the decomposed part of the speech signals is reconstructed, at various levels, a different part of the signal is obtained. The work uses Daubechies’ wavelet as the mother wavelet function and four decomposition and reconstruction orthogonal wavelet filters. For smaller order of the Daubechies’ wavelet, we get smaller wavelet and better time resolution. However, the frequency response of low order wavelet has many sidelobes. By increasing the order, we get smoother version of the mother wavelet which is better for analyzing the voiced signal. Therefore, according to [12], we choose to use a 10-order wavelet.

5. The Proposed SOM Based Segmentation Algorithm

SOM has a special property of effectively creating spatially organized “internal representations” of various features of input signals and their abstraction [15]. SOMs can be considered as a data visualization technique, that is, it provides some underlying structure of the data [16]. This idea is used in our vowel segmentation technique. The weight vector obtained by training a one-dimensional SOM with the LPC features of a word containing the vowel to be segmented is used in the work. Training the same SOM for various iterations, we get different weight vectors, each of which is considered as a segment of different phoneme constituting the word. The work covers all the Assamese vowels. The SOM weight vector extraction algorithm can be summarized as on the block diagram of Figure 2. The algorithm for a particular iteration can be mathematically stated as in Algorithm 1. The SOM weight vector thus extracted are stored as , , , , , and . SOM role is to provide segmentation boundaries for the phonemes. Here, six different segmentation boundaries are obtained for six separate sets of weight vectors. The weight vectors thus obtained, are used by the PNN to recognize the vowel phoneme as described in Section 6.

(1) Input: Spoken word of size , sampling
    frequency , duration second
(2) Preprocess the signal using preprocessing
  algorithms described in Section 7.1
(3) Initialize , order of linear prediction
(4) Find the coefficients of a th-order linear predictor FIR filter,
  
   that predicts the current value of the real-valued
   preprocessed time series , based on past samples.
(5) Store
(6) Take an topology map whose neurons are arranged in
  an -dimensional hexagonal pattern.
(7) Initialize weight to a small random number.
(8) Initialize learning parameter, and neighbors,
(9) for to ( ) do
    pick a(k)
  find Wining neuron as,
 Update synaptic vectors of the wining cluster,
   
        Update ,
(10) Store updated weight as , where

6. The Recognition Algorithm

Four PNNs trained with clean vowel phonemes are used to identify the vowel phoneme from the six segments obtained by DWT or SOM based segmentation. Advantage of the PNN is that it is guaranteed to approach the Bayes optimal decision. The Bayesian decision criterion is the optimal classification procedure if the pdfs of the classes to be discriminated are known a priori. PNN is the fastest network since its learning process requires only localized parameter adaptations with a single pass through the training set [17]. Here, a two-class PNN problem is performed, where four PNNs are trained with two clean vowel phonemes and are named as PNN1, PNN2, PNN3, and PNN4, that is, the output classes of PNN1 are /i/ and /u/, the output classes of PNN2 are /e/ and /o/, and so forth. These four PNNs are used sequentially to identify the segmented vowel phonemes. Clean vowels are recorded from five male and five female speakers which are used as the inputs in the input layer of PNN and provided to each neuron of PNN pattern layer. The PNN learning algorithm can be stated as in Algorithm 2.

(1) Statement: Classify input vowel patterns into
two category of vowel, VOWEL-A and VOWEL-B
(2) Initialize: Smoothing parameter
(Determined from observation of successful learning)
(3) Output of each pattern unit,
( is the weight vector)
(4) Find neuron activation function by performing non-linear operation of the form
            
(5) Sum all the for category VOWEL-A and do the same for category VOWEL-B
(6) Take binary decision for the two summation outputs with variable weight given by-
            
where,
and Priori probability of occurrence of pattern from VOWEL-A and VOWEL-B respectively
   Loss associated with wrong decision
and No of patterns in
VOWEL-A and VOWEL-B respectively which is 10 for both categories

A two-step decision is taken by the recognition algorithm. First match the vowel segment with the PNN patterns and then checking its first formant frequency F1, whether it lies in the predetermined range or not. The PNN and F1 based recognition algorithm for a particular vowel /i/ can be stated as in Algorithm 3. The F1 of all the phonemes is preestimated as explained in Section 7.2.

(1) Input: Speech of size , sampling frequency , duration second
(2) Preprocess the signal using preprocessing algorithms described in Section 7.1
(3) Obtain Dw1, Dw2, Dw3, Dw4, Dw5, Dw6 using the DWT
   based segmentation described in Section 4 or obtain SW1, SW2, SW3, SW4, SW5, SW6 using the SOM weight
   vector extraction algorithm described in Section 5
(4) Find first formant frequency of Dw1, Dw2, Dw3, Dw4,
Dw5, Dw6 and store as , , , , and or Find first formant frequency of SW1, SW2, SW3, SW4,
SW5, SW6 and store as , , , , and
(5) Load PNN1
(6) Decide VOWEL-A
If Dw1 = VOWEL-A and = F1 of vowel
else if
Dw2 = VOWEL-A and = F1 of vowel
else if
Dw3 = VOWEL-A and = F1 of vowel
else if
Dw4 = VOWEL-A and = F1 of vowel
else if
Dw5 = VOWEL-A and = F1 of vowel
else if
Dw6 = VOWEL-A and F1 of vowel
else Decide
“Not Assamese vowel Phoneme /i/”.
or
Decide VOWEL-A
If SW1 = VOWEL-A and F1 of vowel
else if
SW2 = VOWEL-A and F1 of vowel
else if
SW3 = VOWEL-A and = F1 of vowel
else if
SW4 = VOWEL-A and = F1 of vowel
else if
SW5 = VOWEL-A and = F1 of vowel
else if
SW6 = VOWEL-A and = F1 of vowel
else Decide
“Not Assamese vowel Phoneme /i/”.

7. Experimental Details

The work is carried out as per the flow diagram of Figure 3. The work covers only the vowel phonemes. The experimental speech samples are recorded from five female speakers and five male speakers. At first clean vowel phonemes of 2 seconds duration are recorded at 8000 samples/second and 16 bit/sample, which results in a total of 10 broad sets of speech signals. In the second phase two alphabet words like , , , , and so forth. are recorded with some other female and male speakers. These words are later segmented to obtain the first phoneme. The following sections describe various experimental work and results.

7.1. Preprocessing

The preprocessing of the speech signal consists of two operations—smoothing of the signal by median filtering and removal of the silent part by threshold method. Although the speech signals are recorded in a noise free environment, presence of some unwanted spikes is observed. Therefore a median filtering operation is performed on the raw speech signals, so that the vowel segmentation does not suffer by any types of unwanted frequency component [18].

The smoothed signal contains both speech and nonspeech part. The nonspeech or silent part occurs in a speech signal due to the time spent by the speaker before and after uttering the speech and this time information is considered to be redundant for a vowel segmentation purpose. The silent part ideally has zero intensity. But in practical cases it is observed that even after smoothing, the silent part of the speech signal has intensity about 0.04. Our preprocessing algorithm considers this intensity value as a threshold as in the algorithm of Algorithm 4. Thus a pure signal containing only the necessary speech part is obtained.

(1) Input: Speech of size , sampling frequency ,
duration second
(2) Output: The speech part of the signal of new size ,
duration
(3) for to do
         for to do
(4)
(5) end for
(6) end for
(7) initialize
(8)
(9)
(10) Initialize
(11) for to do
if then
if then
end if
end for
(12) Initialize
(13) for to do
if then
else
end if
end for

7.2. Estimation of Formant Frequency

Formant frequency of all of the eight vowels of Assamese language is estimated using the LP model of spectral envelope of the vocal tract. In the LP analysis, an all-pole prediction filter models the vocal tract and the angular position of the poles of the filter gives the formant frequencies [9]. A total of eight clean vowel phonemes and eight segmented vowel phonemes are used from which an average range of the first and second formant for each vowel phonemes are estimated. The vocal tract spectral envelop for a vowel /i/ spoken by a female speaker is shown in Figure 4.

In LPC, the formant center frequency and bandwidths can be determined accurately by factoring the predictor polynomial. Since the predictor order is chosen apriori in LP analysis, the maximum possible number of complex poles can be obtained is . Extraneous poles are easily isolated in LP analysis since their bandwidths are often very large compared to the formant bandwidth [10, 19]. Thus the predictor order is an important factor in formant estimation using LP analysis. Five or six resonances of vocal tract is enough to represent formant location. Another important factor, while choosing the prediction order, is the sampling frequency of the speech signal as explained in Section 3 [10]. Keeping in mind these two facts, the predictor order chosen for this work is 8. This means that 8 values are needed: 7 reflection coefficients and the gain.

Table 2 represents the first two formant frequency of all the vowels. Figure 5 shows a plot of the F2 versus F1 which proves that the so-called vowel triangle can be visualized from the estimated formant values.

7.3. Segmentation and Classification Results

As described in Section 4 and Section 5, the segmentation is carried out twice by using two different methods: first using a conventional DWT based method and secondly using an SOM based method. The segmented phonemes are then checked one by one to find which particular segment represents the vowel phoneme by matching pattern with the trained PNNs. It is observed that the recognition success rate for various vowel is abruptly increased (from 89.6% to 96.27%) with the SOM based segmentation technique. Table 3 summarizes this performance difference. The SOM is trained for six different iterations which provides identical number of decision boundaries. For 50 sessions a segment named SW1 is obtained (Figure 2). Similarly, for 100, 1000, 1500, 2000, and 3000 sessions other five segments are obtained. This is somewhat similar to the segmentation carried out using DWT. The DWT provides six levels of decomposition with faster processing time than the SOM. In case of SOM, some vital time slots are lost in training. But the segments provided by SOM provide better resolution which helps the PNN to provide an improved discrimination capability subsequently. Table 4 shows success rate for various vowels. The computational time of the SOM based algorithm is more in comparison to the DWT based method. But it is observed that in the case of the SOM based method, the recognition rate is quite satisfactory. With distributed processors and better programming techniques the computational time of the proposed algorithm can be further reduced.

Another solution for phoneme segmentation can be the empirical mode decomposition (EMD) technique, which separates time series data into intrinsic oscillation based on local, temporal, and structural characteristics of the data. The EMD technique is an important alternative to traditional time series data analysis methods like wavelet analysis [2022]. The results obtained from the proposed approach can be compared to that obtained from EMD which can be a suggested future direction to the present work.

8. Conclusion

This paper describes a hybrid neural framework based algorithm for segmentation and recognition of Assamese vowel phonemes from some two alphabet Assamese words containing these vowel phonemes. Although recognition time is more in the proposed algorithm, but it shows distinct advantage in terms of success rate as compared to the DWT based approach. By using a distributed processing, the computational time can be further reduced. By developing similar algorithms to recognize other constituent phonemes including the initial one, a complete speaker independent speech recognition system can be developed exclusively for Assamese language.