Abstract

Generally speaking, spoken term detection system will degrade significantly because of mismatch between acoustic model and spontaneous speech. This paper presents an improved spoken term detection strategy, which integrated with a novel phoneme confusion matrix and an improved word-level minimum classification error (MCE) training method. The first technique is presented to improve spoken term detection rate while the second one is adopted to reject false accepts. On mandarin conversational telephone speech (CTS), the proposed methods reduce the equal error rate (EER) by 8.4% in relative.

1. Introduction

In recent years, there is an increasing trend towards the use of spoken term detection systems for real-world applications. In such systems, it is desirable to achieve the highest possible spoken term detection rate, while minimizing the number of false spoken term insertions. Unfortunately, most speech recognition systems fail to perform well when speakers have a regional accent. Particulary in China, the diversity of Mandarin accents is great and evolving.

Pronunciation variation has become an important topic. Normally, a confusion matrix is adopted to achieve higher recognition rate in speech recognition system. In [1], confusion matrix is adopted in spoken document retrieval system. Retrieval performance is improved by exploiting phoneme confusion probabilities. The work in [2] introduces an accent adaptation approach in which syllable confusion matrix is adopted. Similar approaches are discussed in [3].

The quality of confusion matrix has an obvious influence on the performance of spoken term detection. Based on traditional approaches, we propose an improved method to generate a phoneme confusion matrix.

MCE is one of the main approaches in discriminative training [4]. In [5], MCE is used to optimize the parameters of confidence function in large vocabulary speech recognition system (LVCSR). The work in[6] introduces MCE into spoken term detection. In this paper, we present an improved MCE training method for calculating spoken term confidence.

The remainder of the paper is structured as follows: Section 2 introduces our baseline system. In Section 3, we discuss the phoneme confusion matrix based on confusion network. An improved MCE training method is presented in Section 4. In Section 5, the experiments are given and discussed, and finally Section 6 draws some conclusions from the proposed research.

2. Baseline System

In our baseline system, search space is generated based on all Chinese syllables, not specifically for spoken terms. Phoneme recognition is performed without any lexical constraints. Given a spoken input, our decoder outputs 1-best phoneme sequence. A phoneme confusion matrix is used to extract spoken terms.

The main steps of generating phoneme confusion matrix are listed as follows [2].(1)Canonical pin-yin level transcriptions of the accent speech data should be obtained firstly.(2)A standard Mandarin acoustic recognizer whose output is pin-yin stream is used to transcribe those accent speech data.(3)With the help of dynamic programming (DP) technique, these pin-yin level transcriptions are aligned to the canonical pin-yin level transcriptions.(4)Regardless of insertion and deletion errors, substitution errors are considered. Each pin-yin can be divided into two phonemes. Given a canonical phoneme phπ‘š and an aligning hypothesis ph𝑛, we can compute confusion probability: 𝑃phπ‘›βˆ£phπ‘šξ€Έ=ξ€·countphπ‘›βˆ£phπ‘šξ€Έβˆ‘π‘π‘–=1ξ€·countphπ‘–βˆ£phπ‘šξ€Έ,(2.1) where count(phπ‘›βˆ£phπ‘š) is the number of ph𝑛 which is aligned to phπ‘š. 𝑁 is the total phoneme number in dictionary.

With 1-best phoneme sequence and confusion matrix, similarities between phonemes are computed. For each spoken term, corresponding phonemes will be searched from pronunciation dictionary firstly. Then, sliding window is used to align phonemes of spoken term and 1-best phoneme sequence. The step of sliding window is set to two because there are two phonemes in each syllable in Chinese. An example of searching β€œgu zhe” is given in Figure 1.

Given a term πœ‘1, πœ‘2 is the aligning 1-best phoneme sequence. Then, similarity between them is denoted as Sim(πœ‘1,πœ‘2): ξ€·πœ‘Sim1,πœ‘2ξ€Έ=1𝑁log𝑁𝑖=1π‘ƒξ€·π›½π‘–βˆ£π›Όπ‘–ξ€Έξƒͺ,(2.2) where 𝛼𝑖 and 𝛽𝑖 are the 𝑖th phoneme of πœ‘1 and πœ‘2, respectively, 𝑁 is the number of phonemes of πœ‘1.

Spoken term rate gets a significant improvement with the help of confusion matrix. But at the same time, false accepts have been increased too. Effective confidence measure should be adopted to reject false hypotheses. In this paper, word confidence is calculated with catch-all model [5]. A confidence score for a hypothesized phoneme ph𝑖 is estimated by ξ€·CMph𝑖=1𝑒[𝑖][𝑖]βˆ’π‘+1𝑒[𝑖]𝑛=𝑏[𝑖]ξ€·π‘žlog𝑝(𝑛)βˆ£π‘œ(𝑛)ξ€Έ=1𝑒[𝑖][𝑖]βˆ’π‘+1𝑒[𝑖]𝑛=𝑏[𝑖]π‘ƒξ€·π‘œlog(𝑛)βˆ£π‘ž(𝑛)ξ€Έπ‘ƒξ€·π‘ž(𝑛)ξ€Έπ‘ƒξ€·π‘œ(𝑛)ξ€Έ,(2.3) where 𝑏[𝑖] is the start time of ph𝑖 and 𝑒[𝑖] is the end time. π‘ž(𝑛) represents Viterbi state sequence.

Deriving word level scores from phoneme scores is a natural extension of the recognition process. We adopted the arithmetic mean in logarithmic scale. Spoken term confidence CMpos is defined as CMpos1(𝑀)=π‘šπ‘šξ“π‘–=1ξ€·CMph𝑖,(2.4) where π‘šis the number of phonemes in 𝑀.

3. Confusion Matrix Based on Confusion Network

Just as the above description, confusion matrix is generated from 1-best hypothesis. However, there is a conceptual mismatch between decoding criterion and confusion probability evaluation. Given an input utterance, a Viterbi decoder is used to find the best sentence. But it does not ensure that each phoneme is the optimal one. In this paper, we propose an improved method of generating confusion matrix. Instead of 1-best phoneme hypothesis, we get hypotheses from confusion network (CN) [7].

Just as Figure 2 describes, CN is composed of several branches. For schematic description, we give top 4 hypotheses in each branch. Corresponding canonical pin-yin stream is also presented in Figure 2.

Experimental show that syllable error rate (SER) of CN is far lower than that of 1-best sequence. Base on this point, we believe that CN provides us more useful information. In this paper, we attempt to use n-best hypotheses of each branch. Firstly, canonical pin-yin level transcriptions are formatted into a simple CN. Then, recognizer output voting error reduction (ROVER) technology is adopted to align two CNs. At last, we select special branches to generate confusion matrix. Given a canonical phoneme phπ‘š, only branches including phπ‘š are considered. A sequence of class labels𝛼(π‘˜) is defined as 𝛼(π‘˜)=1ifphπ‘šβˆˆtheπ‘˜thBranch,0ifphπ‘šβˆ‰theπ‘˜thBranch.(3.1) Then, (2.1) can be rewritten as 𝑃phπ‘›βˆ£phπ‘šξ€Έ=βˆ‘πΆπ‘˜=1𝛼(π‘˜)countphπ‘›βˆ£phπ‘šξ€Έβˆ‘π‘π‘–=1βˆ‘πΆπ‘˜=1𝛼(π‘˜)countphπ‘–βˆ£phπ‘šξ€Έ,(3.2) where 𝐢 is the number of branches in CNs of training data, 𝑁 is the number of phonemes in dictionary.

Another optional method is also attempted in this paper. Max probability rule can be applied in calculating confusion probability. The branches with maximum probability phπ‘š are considered. We define ξ‚»1𝛽(π‘˜)=ifphπ‘šisaphonemewithmaximumprobabilityintheπ‘˜thBranch,0others.(3.3)

Then, (3.2) can be rewritten as ξ€·phπ‘›βˆ£phπ‘šξ€Έ=βˆ‘πΆπ‘˜=1𝛽(π‘˜)countphπ‘›βˆ£phπ‘šξ€Έβˆ‘π‘π‘–=1βˆ‘πΆπ‘˜=1𝛽(π‘˜)countphπ‘–βˆ£phπ‘šξ€Έ.(3.4)

4. MCE with Block Training

The work in [5] proposed a word-level MCE training technique in optimizing the parameters of the confidence function. In [6], a revised scheme is implemented under spoken term scenario. In this paper, we attempt to improve the MCE training methods proposed in [6].

According to the update equations in [6], sequential training is used to update parameters. That is to say, the parameters of triphones are modified with each training sample. It is not matched well with optimization method of MCE. We adopt block training method instead. The parameters are modified with all averaged samples at once. The weighted mean confidence measure of π‘Š is defined as 1CM(π‘Š)=𝑁𝑀𝑁𝑀𝑖=1ξ€·π‘Žph𝑖CMph𝑖+𝑏ph𝑖.(4.1)

Procedures of block training are listed as follows.(1)Misclassification measure is defined as 𝑑(π‘Š)=(CM(π‘Š)βˆ’πΆ)Γ—Sign(π‘Š),(4.2) where 𝐢 is confidence threshold, Sign(π‘Š) is defined as ξ‚»Sign(π‘Š)=1ifπ‘Šisincorrect,βˆ’1ifπ‘Šiscorrect.(4.3)(2)A smooth zero-one loss function is given by 1𝑙(π‘Š)=1+exp(βˆ’π›Ύπ‘‘(π‘Š)).(4.4)(3)The parameter estimation is based on the minimization of the expected loss which, for a training sample of size 𝑀, is defined as π‘™ξ‚€π‘Šξ‚1=𝐸(𝑙(π‘Š))=𝑀𝑀𝑗=1π‘™ξ€·π‘Šπ‘—ξ€Έ.(4.5)

Generalized probabilistic descent (GPD) algorithm is used to minimize the loss function 𝑙(π‘Š) [8]: ξ‚€πœ•π‘™π‘Šξ‚πœ•π‘Žph𝑖=1𝑀𝑀𝑗=1ξ€·π‘Šπœ•π‘™π‘—ξ€Έπœ•π‘Žph𝑖=𝛾𝑀ph𝑖𝑀ph𝑖𝑗=11π‘π‘—πΎξ€·π‘Šπ‘—ξ€ΈCM𝑗ph𝑖,ξ‚€πœ•π‘™π‘Šξ‚πœ•π‘ph𝑖=1π‘€π‘€βˆ‘π‘—=1ξ€·π‘Šπœ•π‘™π‘—ξ€Έπœ•π‘ph𝑖=𝛾𝑀ph𝑖𝑀phπ‘–βˆ‘π‘—=1πΎξ€·π‘Šπ‘—ξ€Έπ‘π‘—,ξ‚€πœ•π‘™π‘Šξ‚=1πœ•πΆπ‘€π‘€βˆ‘π‘—=1ξ€·π‘Šπœ•π‘™π‘—ξ€Έ=πœ•πΆβˆ’π›Ύπ‘€π‘€βˆ‘π‘—=1πΎξ€·π‘Šπ‘—ξ€Έ,(4.6) where 𝑀ph𝑖 is the number of samples that contain the phoneme ph𝑖, 𝑁𝑗 is the number of phonemes of π‘Šπ‘—, CM𝑗(ph𝑖) is the confidence of ph𝑖 in π‘Šπ‘—. However π‘˜(π‘Šπ‘—) is defined asπ‘˜ξ€·π‘Šπ‘—ξ€Έξ€·π‘Š=π‘™π‘—ξ€·π‘Šξ€Έξ€·1βˆ’π‘™π‘—ξ€·π‘Šξ€Έξ€ΈSign𝑗.(4.7)

At last, we get the revised update equations asΜƒπ‘Žph𝑖(𝑛+1)=Μƒπ‘Žph𝑖(𝑛)βˆ’πœ€π‘›ξ‚€πœ•π‘™π‘Šξ‚πœ•π‘Žph𝑖expΜƒπ‘Žph𝑖,𝑏(𝑛)ph𝑖(𝑛+1)=𝑏ph𝑖(𝑛)βˆ’πœ€π‘›ξ‚€πœ•π‘™π‘Šξ‚πœ•π‘ph𝑖,𝐢(𝑛+1)=𝐢(𝑛)βˆ’πœ€π‘›ξ‚€πœ•π‘™π‘Šξ‚.πœ•πΆ(4.8)

5. Experiments

We conducted experiments using our real-time spoken term system. Acoustic model is trained using train04, which is collected by Hong Kong University of Science and Technology (HKUST).

5.1. Experimental Data Description

The test data is a subset of development data (dev04), which is also collected by HKUST. Total 20 conversations are used for our evaluation. 100 words are selected as the spoken term list, including 75 two-syllable words and 25 three-syllable words.

Confusion matrixes adopted in this paper are generated using 100-hour mandarin CTS corpus. The word-level MCE training set is a subset of train04 corpus. 865667 terms are extracted for the training, including 675998 false accepts and 189669 correct hits.

5.2. Experiment Results

The detection error tradeoff (DET) is used in this paper to evaluate the performance of spoken term. The false acceptation (FA) rate fits the case in which an incorrect word is accepted, and the false reject (FR) fits the case of rejecting the correct word: FA=num.ofincorrectwordslabelledasaccepted,num.ofincorrectwordsFR=num.ofcorrectwordslabelledasrejected,num.ofkeywordsβˆ—hoursoftestsetβˆ—πΆ(5.1) where 𝐢 is a factor which scales the dynamic range of FA and FR on the same level. In this paper, 𝐢 is set to 10. Recognition rates (RA) are also computed. It can be obtained as:RA=num.ofcorrectwordslabelledasacceptedtotalnum.ofrecognizedwords.(5.2)

In order to assess how CN gives more information than 1-best pin-yin sequence, the syllable error rates (SERs) of both CN and pin-yin sequence are given in Table 1. SER of CN drops significantly with the reduction of pruning beam.

Table 2 summarizes recognition rates of different confusion matrixes. With the n-best hypotheses of CN, recognition rates are improved obviously. Then maximum probability rule is applied, and the recognition rate arrives 82.0%.

To evaluate the performance of methods proposed in this paper, EERs of different methods are listed in Table 3.

As we can see from Table 3, the improved confusion matrixes provide obviously EER reduction of up to 3.9% in relative. MCE with block training is superior to sequential training, relative 1.7% EER reduction is achieved. When two methods are used at the same time, we get a further improvement, 8.4% relative reduction compared with the baseline system.

6. Conclusions

In order to describe how the accent-specific pronunciation differs from those assumed by the standard Mandarin recognition system, the phoneme confusion matrix is adopted. Different from traditional algorithm, confusion network is applied in generating confusion matrix. It improves the recognition rate of spoken term system. Moreover, a revised MCE training method is presented in this paper. Experiments prove that it performs obviously better than the sequential training.