Abstract
A proven method for achieving effective automatic speech recognition (ASR) due to speaker differences is to perform
acoustic feature speaker normalization. More effective speaker normalization methods are needed which require
limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract
length normalization (VTLN), despite the fact that it is computationally expensive. In this study, we propose a
novel online VTLN algorithm entitled built-in speaker normalization (BISN), where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend
processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN
method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp
simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces
simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces
computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i)
an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate
(WER) by 24%, and (ii) for a diverse noisy speech task (SPINE 2), where the relative WER improvement was 9%, both
relative to the baseline speaker normalization method.