﻿<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>EURASIP Journal on Audio, Speech, and Music Processing</title><link>http://www.hindawi.com</link><description>The latest articles from Hindawi Publishing Corporation</description><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright><item><title>Beamforming under Quantization Errors in Wireless Binaural Hearing Aids</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/824797</link><description>Improving the intelligibility of speech in different environments is one of the main objectives of
hearing aid signal processing algorithms. Hearing aids typically employ beamforming techniques using multiple microphones for this task. In this paper, we discuss a binaural beamforming scheme that uses signals from the hearing aids worn on both the left and right ears. Specifically, we analyze the effect of a low bit rate wireless communication link between the left and right hearing aids on the performance of the beamformer. The scheme is comprised of a generalized sidelobe canceller (GSC) that has two inputs: observations from one ear, and quantized observations from the other ear, and whose output is an estimate of the desired signal. We analyze the performance of this scheme in the presence of a localized interferer as a function of the communication bit rate using the resultant mean-squared error as the signal distortion measure.</description><Author>Sriram Srinivasan, Ashish Pandharipande, and Kees Janse</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/148967</link><description>A proven method for achieving effective automatic speech recognition (ASR) due to speaker differences is to perform
acoustic feature speaker normalization. More effective speaker normalization methods are needed which require
limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract
length normalization (VTLN), despite the fact that it is computationally expensive. In this study, we propose a
novel online VTLN algorithm entitled built-in speaker normalization (BISN), where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend
processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN
method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp
simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces
simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces
computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i)
an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate
(WER) by 24&amp;#37;, and (ii) for a diverse noisy speech task (SPINE 2), where the relative WER improvement was 9&amp;#37;, both
relative to the baseline speaker normalization method.</description><Author>Umit H. Yapanel and John H. L. Hansen</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Voice-to-Phoneme Conversion Algorithms for Voice-Tag Applications in Embedded Platforms</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/568737</link><description>We describe two voice-to-phoneme conversion algorithms for speaker-independent voice-tag creation specifically targeted at applications on embedded platforms. These algorithms (batch mode and sequential) are compared in speech recognition experiments where they are first applied in a same-language context in which both acoustic model training and voice-tag creation and application are performed on the same language. Then, their performance is tested in a cross-language setting where the acoustic models are trained on a particular source language while the voice-tags are created and applied on a different target language. In the same-language environment,  both algorithms either perform comparably to or significantly better than the baseline where utterances are manually transcribed by a phonetician. In the cross-language context, the voice-tag performances vary depending on the source-target language pair, with the variation reflecting predicted phonological similarity between the source and target languages. Among the most similar languages, performance nears that of the native-trained models and surpasses the native reference baseline.</description><Author>Yan Ming Cheng, Changxue Ma, and Lynette Melnar</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Quality Enhancement of Compressed Audio Based on Statistical Conversion</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/462830</link><description>Most audio compression formats are based on the idea of low bit rate transparent encoding. As these types of audio signals are starting to migrate from portable players with inexpensive headphones to higher quality home audio systems, it is becoming evident that higher bit rates may be required to maintain transparency. We propose a novel method that enhances low bit rate encoded audio segments by applying multiband audio resynthesis methods in a postprocessing stage. Our algorithm employs the highly flexible Generalized Gaussian mixture model which offers a more accurate representation of audio features than the Gaussian mixture model. A novel residual conversion technique is applied which proves to significantly improve the enhancement performance without excessive overhead. In addition, both cepstral and residual errors are dramatically decreased by a feature-alignment scheme that employs a sorting transformation. Some improvements regarding the quantization step are also described that enable us to further reduce the algorithm overhead. Signal enhancement examples are presented and the results show that the overhead size incurred by the algorithm is a fraction of the uncompressed signal size. Our results show that the resulting audio quality is comparable to that of a standard perceptual codec operating at approximately the same bit rate.</description><Author>Demetrios Cantzos, Athanasios Mouchtaris, and Chris Kyriakakis</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Automatic Music Boundary Detection Using Short Segmental Acoustic Similarity in a Music Piece</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/480786</link><description>The present paper proposes a new approach for detecting music boundaries, such as the boundary between music pieces or the boundary between a music piece and a speech section for automatic segmentation of musical video data and retrieval of a designated music piece. The proposed approach is able to capture each music piece using acoustic similarity defined for short-term segments in the music piece. The short segmental acoustic similarity is obtained by means of a new algorithm called segmental continuous dynamic programming, or segmental CDP. The location of each music piece and its music boundaries are then identified by referring to multiple similar segments and their location information, avoiding oversegmentation within a music piece. The performance of the proposed method is evaluated for music boundary detection using actual music datasets. The present paper demonstrates that the proposed method enables accurate detection of music boundaries for both the evaluation data and a real broadcasted music program.</description><Author>Yoshiaki Itoh, Akira Iwabuchi, Kazunori Kojima, Masaaki Ishigame, Kazuyo Tanaka, and Shi-Wook Lee</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Measurement Combination for Acoustic Source Localization in a Room Environment</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/278185</link><description>The behavior of time delay estimation (TDE) is well understood and therefore attractive to apply in acoustic source localization (ASL). A time delay between microphones maps into a hyperbola. Furthermore, the likelihoods for different time delays are mapped into a set of weighted nonoverlapping hyperbolae in the spatial domain. Combining TDE functions from several microphone pairs results in a spatial likelihood function (SLF) which is a combination of sets of weighted hyperbolae. Traditionally, the maximum SLF point is considered as the source location but is corrupted by reverberation and noise. Particle filters utilize past source information to improve localization performance in such environments. However, uncertainty exists on how to combine the TDE functions. Results from simulated dialogues in various conditions favor TDE combination using intersection-based methods over union. The real-data dialogue results agree with the simulations, showing a 45&amp;#37; RMSE reduction when choosing the
intersection over union of TDE functions.</description><Author>Pasi Pertil&amp;#228;, Teemu Korhonen, and Ari Visa</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Fast Noise Compensation and Adaptive Enhancement for Speech Separation</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/349214</link><description>We propose a novel approach to improve adaptive decorrelation filtering- (ADF-) based speech source
separation in diffuse noise. The effects of noise on system adaptation and separation outputs are handled
separately. First, fast noise compensation (NC) is developed for adaptation of separation filters, forcing
ADF to focus on source separation; next, output noises are suppressed by speech enhancement. By
tracking noise components in output cross-correlation functions, the bias effect of noise on the system
adaptation objective function is compensated, and by adaptively estimating output noise autocorrelations,
the speech separation output is enhanced. For fast noise compensation, a blockwise fast ADF (FADF) is
implemented. Experiments were conducted on real and simulated diffuse noises. Speech mixtures were
generated by convolving TIMIT speech sources with acoustic path impulse responses measured in a
real room with reverberation time T60=0.3&amp;#x2009;second. The proposed techniques significantly improved separation performance and phone recognition accuracy of ADF outputs.</description><Author>Rong Hu and Yunxin Zhao</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Real-Time Perceptual Simulation of Moving Sources: Application to the Leslie Cabinet and 3D Sound Immersion</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/849696</link><description>Perception of moving sound sources obeys different brain processes from those mediating the localization of static sound events. In view of these specificities, a preprocessing model was designed, based on the main perceptual cues involved in the auditory perception of moving sound sources, such as the intensity, timbre, reverberation, and frequency shift processes. This model is the first step toward a more general moving sound source system, including a system of spatialization. Two applications of this model are presented: the simulation of a system involving rotating sources, the Leslie Cabinet and a 3D sound immersion installation based on the sonification of cosmic particles, the Cosmophone.</description><Author>R. Kronland-Martinet and T. Voinier</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>On a Method for Improving Impulsive Sounds Localization in Hearing Defenders</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/274684</link><description>This paper proposes a new algorithm for a directional aid with hearing defenders. Users of existing hearing defenders experience distorted information, or in worst cases, directional information may not be perceived at all. The users of these hearing defenders may therefore be exposed to serious safety risks. The proposed algorithm improves the directional information for the users of hearing defenders by enhancing impulsive sounds using interaural level difference (ILD). This ILD enhancement is achieved by incorporating a new gain function. Illustrative examples and performance measures are presented to highlight the promising results. By improving the directional information for active hearing defenders, the new method is found to serve as an advanced directional aid.</description><Author>Benny S&amp;#228;llberg, Farook Sattar, and Ingvar Claesson</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Tango or Waltz?: Putting Ballroom Dance Style into Tempo Detection</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/846135</link><description>Rhythmic information plays an important role in Music Information Retrieval. Example applications include automatically annotating large databases by genre, meter, ballroom dance style or tempo, fully automated D.J.-ing, and audio segmentation for further retrieval tasks such as automatic chord labeling. In this article, we therefore provide an introductory overview over basic and current principles of tempo detection. Subsequently, we show how to improve on these by inclusion of ballroom dance style recognition. We introduce a
feature set of 82 rhythmic features for rhythm analysis on real audio. With this set, data-driven identification of the meter and ballroom dance style, employing support vector machines, is carried out in a first step. Next, this information is used to more robustly detect tempo. We evaluate the suggested method on a large public database containing 1.8&amp;#x2009;k titles of standard and Latin ballroom dance music. Following extensive test runs, a clear boost in performance can be reported.</description><Author>Bj&amp;#246;rn Schuller, Florian Eyben, and Gerhard Rigoll</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Frequency-Domain Adaptive Algorithm for Network Echo Cancellation in VoIP</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/156960</link><description>We propose a new low complexity, low delay, and fast converging frequency-domain adaptive algorithm for network echo cancellation in VoIP exploiting MMax and sparse partial (SP) tap-selection criteria in the frequency domain. We incorporate these tap-selection techniques into the multidelay filtering (MDF) algorithm in order to mitigate the delay inherent in frequency-domain algorithms. We illustrate two such approaches and discuss their tradeoff between convergence performance and computational complexity. Simulation results show an improvement in convergence rate for the proposed algorithm over MDF and significantly reduced complexity. The proposed algorithm achieves a convergence performance close to that of the recently proposed, but substantially more complex improved proportionate MDF (IPMDF) algorithm.</description><Author>Xiang (Shawn) Lin, Andy W. H. Khong, Milo&amp;#774;&amp;#x0073; Doroslova&amp;#774;&amp;#x0063;ki, and Patrick A. Naylor</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Phasor Representation for Narrowband Active Noise Control Systems</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/126859</link><description>The phasor representation is introduced to identify the characteristic of the active noise control (ANC) systems. The conventional representation, transfer function, cannot explain the fact that the performance will be degraded at some frequency for the narrowband ANC systems. This paper uses the relationship of signal phasors to illustrate geometrically the operation and the behavior of two-tap adaptive filters. In addition, the best signal basis is therefore suggested to achieve a better performance from the viewpoint of phasor synthesis. Simulation results show that the well-selected signal basis not only achieves a better convergence performance but also speeds up the convergence for narrowband ANC systems.</description><Author>Fu-Kun Chen, Ding-Horng Chen, and Yue-Dar Jou</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Estimation of Interchannel Time Difference in Frequency Subbands Based on Nonuniform Discrete Fourier Transform</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/618104</link><description>Binaural cue coding (BCC) is an efficient technique for spatial audio rendering by using the side information such as interchannel level difference (ICLD), interchannel time difference (ICTD), and interchannel correlation (ICC). Of the side information, the ICTD plays an important role to the auditory spatial image. However, inaccurate estimation of the ICTD may lead to the audio quality degradation. In this paper, we develop a novel ICTD estimation algorithm based on the nonuniform discrete Fourier transform (NDFT) and integrate it with the BCC approach to improve the decoded auditory image. Furthermore, a new subjective assessment method is proposed for the evaluation of auditory image widths of decoded signals. The test results demonstrate that the NDFT-based scheme can achieve much wider and more externalized auditory image than the existing BCC scheme based on the discrete Fourier transform (DFT). It is found that the present technique, regardless of the image width, does not deteriorate the sound quality at the decoder compared to the traditional scheme without ICTD estimation.</description><Author>Bo Qiu, Yong Xu, Yadong Lu, and Jun Yang</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Experiments on Automatic Recognition of Nonnative Arabic Speech</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/679831</link><description>The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA). We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC) and the hidden Markov Model Toolkit (HTK) are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.</description><Author>Yousef Ajami Alotaibi, Sid-Ahmed Selouani, and Douglas O&amp;#39;Shaughnessy</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Multiresolution Source/Filter Model for Low Bitrate Coding of Spot Microphone Signals</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/624321</link><description>A multiresolution source/filter model for coding of audio source signals (spot recordings) is proposed. Spot recordings are a subset of the multimicrophone recordings of a music performance, before the mixing process is applied for producing the final multichannel audio mix. The technique enables low bitrate coding of spot signals with good audio quality (above 3.0 perceptual grade compared to the original). It is demonstrated that this particular model separates the various microphone recordings of a multimicrophone recording into a part that mainly characterizes a specific microphone signal and a part that is common to all signals of the same recording (and can thus be omitted during transmission). Our interest in low bitrate coding of spot recordings is related to applications such as remote mixing and real-time collaboration of musicians who are geographically distributed. Using the proposed approach, it is shown that it is possible to encode a multimicrophone audio recording using a single audio channel only, with additional information for each spot microphone signal in the order of 5&amp;#x2009;kbps, for good-quality resynthesis. This is verified by employing both objective and subjective measures of performance.</description><Author>Athanasios Mouchtaris, Kiki Karadimou, and Panagiotis Tsakalides</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Speech Waveform Compression Using Robust Adaptive Voice Activity Detection for Nonstationary Noise</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/639839</link><description>The voice activity detection (VAD) is crucial in all kinds of speech applications. However, almost all existing VAD algorithms suffer from the nonstationarity of both speech and noise. To combat this difficulty, we propose a new voice activity detector, which is based on the Mel-energy features and an adaptive threshold related to the signal-to-noise ratio (SNR) estimates. In this paper, we first justify the robustness of the Bayes classifier using the Mel-energy features over that using the Fourier spectral features in various noise environments. Then, we design an algorithm using the dynamic Mel-energy estimator and the adaptive threshold, which depends on the SNR estimates. In addition, a realignment scheme is incorporated to correct the sparse-and-spurious noise estimates. Numerous simulations are carried out to evaluate the performance of our proposed VAD method and the comparisons are made with a couple of existing representative schemes, namely, the VAD using the likelihood ratio test with Fourier spectral energy features and that based on the enhanced time-frequency parameters. Three types of noises, namely, white noise (stationary), babble noise (nonstationary), and vehicular noise (nonstationary) were artificially added by the computer for our experiments. As a result, our proposed VAD algorithm significantly outperforms other existing methods as illustrated by the corresponding receiver operating characteristics (ROC) curves. Finally, we demonstrate one of the major applications, namely, speech waveform compression associated with our new robust VAD scheme and quantify the effectiveness in terms of compression efficiency.</description><Author>Waheeduddin Q. Syed and Hsiao-Chun Wu</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Robust Transmission of Speech LSFs Using Hidden Markov Model-Based Multiple Description Index Assignments</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/896021</link><description>Speech coding techniques capable of generating encoded representations which are robust against channel losses play an important role in enabling reliable voice communication over packet networks and mobile wireless systems. In this paper, we investigate the use of multiple description index assignments (MDIAs) for loss-tolerant transmission of line spectral frequency (LSF) coefficients, typically generated by state-of-the-art speech coders. We propose a simulated annealing-based approach for optimizing MDIAs for Markov-model-based
decoders which exploit inter- and intraframe correlations in LSF coefficients to reconstruct the quantized LSFs from coded bit streams corrupted by channel losses. Experimental results are presented which compare the performance of a number of novel LSF transmission
schemes. These results clearly demonstrate that Markov-model-based decoders, when used in conjunction with optimized MDIA, can yield average spectral distortion much lower than that produced by methods such as interleaving/interpolation, commonly used to combat the packet
losses.</description><Author>Paul Rondeau and Pradeepa Yahampath</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Perceptual Models for Speech, Audio, and Music Processing</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/12687</link><description /><Author>Jont B. Allen, Wai-Yip Geoffrey Chan, and Stephen Voran</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Practical Gammatone-Like Filters for Auditory Processing</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/63685</link><description>This paper deals with continuous-time filter transfer functions that resemble tuning curves at particular set of places on the basilar membrane of the biological cochlea and that are suitable for practical VLSI implementations. The resulting filters can be used in a filterbank architecture to realize cochlea implants or auditory processors of increased biorealism. To put the reader into context, the paper starts with a short review on the gammatone filter and then exposes two of its variants, namely, the differentiated all-pole gammatone filter (DAPGF) and one-zero gammatone filter (OZGF), filter responses that provide a robust foundation for modeling cochlea transfer functions. The DAPGF and OZGF responses are attractive because they exhibit certain characteristics suitable for modeling a variety of auditory data: level-dependent gain, linear tail for frequencies well below the center frequency, asymmetry, and so forth.  In addition, their form suggests their implementation by means of cascades of N identical two-pole systems which render them as excellent candidates for efficient analog or digital VLSI realizations. We provide results that shed light on their characteristics and attributes and which can also serve as &amp;#x201C;design curves&amp;#x201D; for fitting these responses to frequency-domain physiological data. The DAPGF and OZGF responses are essentially a &amp;#x201C;missing link&amp;#x201D; between physiological, electrical, and mechanical models for auditory filtering.</description><Author>A. G. Katsiamis, E. M. Drakakis, and R. F. Lyon</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Electrophysiological Study of Algorithmically Processed Metric/Rhythmic Variations in Language and Music</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/30194</link><description>This work is the result of an interdisciplinary collaboration between scientists from the fields of audio signal processing, phonetics and cognitive neuroscience aiming at studying the perception of modifications in meter, rhythm, semantics and harmony in language and music. A special time-stretching algorithm was developed to work with natural speech. In the language part, French sentences ending with tri-syllabic congruous or incongruous words, metrically modified or not, were made. In the music part, short melodies made of triplets, rhythmically and/or harmonically modified, were built. These stimuli were presented to a group of listeners that were asked to focus their attention either on meter/rhythm or semantics/harmony and to judge whether or not the sentences/melodies were acceptable. Language ERP analyses indicate that semantically incongruous words are processed independently of the subject&amp;#39;s attention thus arguing for automatic semantic processing. In addition, metric incongruities seem to influence semantic processing. Music ERP analyses show that rhythmic incongruities are processed independently of attention, revealing automatic processing of rhythm in music.</description><Author>S&amp;#248;lvi Ystad, Cyrille Magne, Snorre Farner, Gregory Pallone, Mitsuko Aramaki, Mireille Besson, and Richard Kronland-Martinet</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Multiple-Description Multistage Vector Quantization</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/67146</link><description>Multistage vector quantization (MSVQ) is a technique for low complexity implementation of high-dimensional quantizers, which has found applications within speech, audio, and image coding. In this paper, a multiple-description MSVQ (MD-MSVQ) targeted for communication over packet-loss channels is proposed and investigated. An MD-MSVQ can be viewed as a generalization of a previously reported interleaving-based transmission scheme for multistage quantizers. An algorithm for optimizing the codebooks of an MD-MSVQ for a given packet-loss probability is suggested, and  a practical example involving quantization of speech line spectral frequency (LSF) vectors is presented to demonstrate the potential advantage of MD-MSVQ over interleaving-based MSVQ as well as traditional MSVQ based on error concealment at the receiver.</description><Author>Pradeepa Yahampath</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>The Effect of Listener Accent Background on Accent Perception and Comprehension</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/76030</link><description>Variability of speaker accent is a challenge for effective human communication as well as speech technology including automatic speech recognition and accent identification. The motivation of this study is to contribute to a deeper understanding of accent variation across speakers from a cognitive perspective. The goal is to provide perceptual assessment of accent variation in native and English. The main focus is to investigate how listener&amp;#39;s accent background affects accent perception and comprehensibility. The results from perceptual experiments show that the listeners&amp;#39; accent background impacts their ability to categorize accents. Speaker accent type affects perceptual accent classification. The interaction between listener accent background and speaker accent type is significant for both accent perception and speech comprehension. In addition, the results indicate that the comprehensibility of the speech contributes to accent perception. The outcomes point to the complex nature of accent perception, and provide a foundation for further investigation on the involvement of cognitive processing for accent perception. These findings contribute to a richer understanding of the
cognitive aspects of accent variation, and its application for speech technology.</description><Author>Ayako Ikeno and John H. L. Hansen</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Denoising in the Domain of Spectrotemporal Modulations</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/42357</link><description>A noise suppression algorithm is proposed based on filtering the spectrotemporal modulations of noisy signals. The modulations are estimated from a multiscale representation of the signal spectrogram generated by a model of sound processing in the auditory system. A significant advantage of this method is its ability to suppress noise that has distinctive modulation patterns, despite being spectrally overlapping with the signal. The performance of the algorithm is evaluated using subjective and objective tests with
contaminated speech signals and compared to traditional Wiener filtering method. The results demonstrate the efficacy of the spectrotemporal filtering approach in the conditions examined.</description><Author>Nima Mesgarani and Shihab Shamma</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Perceptual Continuity and Naturalness of Expressive Strength in Singing Voices Based on Speech Morphing</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/23807</link><description>This paper experimentally shows the importance of perceptual continuity of the expressive strength in vocal timbre for natural change in vocal expression. In order to synthesize various and continuous expressive strengths with vocal timbre, we investigated gradually changing expressions by applying the STRAIGHT speech morphing algorithm to singing voices. Here, a singing voice without expression is used as the base of morphing, and singing voices with three different expressions are used as the target. Through statistical analyses of perceptual evaluations, we confirmed that the proposed morphing algorithm provides perceptual continuity of vocal timbre. Our results showed the following: (i) gradual strengths in absolute evaluations, and (ii) a perceptually linear strength provided by the calculation of corrected intervals of the morph ratio by the inverse (reciprocal) function of an equation that approximates the perceptual strength. Finally, we concluded that applying continuity was highly effective for achieving perceptual naturalness, judging from the results showing that (iii) our gradual transformation method can perform well for perceived naturalness.</description><Author>Tomoko Yonezawa, Noriko Suzuki, Shinji Abe, Kenji Mase, and Kiyoshi Kogure</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Wideband Speech Recovery Using Psychoacoustic Criteria</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/16816</link><description>Many modern speech bandwidth extension techniques predict the high-frequency band based on features extracted from the lower band. While this method works for certain types of speech, problems arise when the correlation between the low and the high bands is not sufficient for adequate prediction. These situations require that additional high-band information is sent to the decoder. This overhead information, however, can be cleverly quantized using human auditory system models. In this paper, we propose a novel speech compression method that relies on bandwidth extension. The novelty of the technique lies in an elaborate perceptual model that determines a quantization scheme for wideband recovery and synthesis. Furthermore, a source/filter bandwidth extension algorithm based on spectral spline fitting is proposed. Results reveal that the proposed system improves the quality of narrowband speech while performing at a lower bitrate. When compared to other wideband speech coding schemes, the proposed algorithms provide comparable speech quality at a lower bitrate.</description><Author>Visar Berisha and Andreas Spanias</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Perceptual Coding of Audio Signals Using Adaptive Time-Frequency Transform</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/51563</link><description>Wide band digital audio signals have a very high data-rate associated with them due to their complex nature and demand for high-quality reproduction. Although recent technological advancements have significantly reduced the cost of bandwidth and miniaturized storage facilities, the rapid increase in the volume of digital audio content constantly compels the need for better compression algorithms. Over the years various perceptually lossless compression techniques have been introduced, and
transform-based compression techniques have made a significant impact in recent years. In this paper, we propose one such transform-based compression technique, where the joint time-frequency (TF) properties of the nonstationary nature of the audio signals were exploited in creating a compact energy representation of the signal in fewer coefficients. The decomposition coefficients were processed and perceptually filtered to retain only the relevant coefficients. Perceptual filtering (psychoacoustics) was applied in a novel way by analyzing and performing TF specific psychoacoustics experiments. An added advantage of the proposed technique is that, due to its signal adaptive nature, it does not
need predetermined segmentation of audio signals for processing. Eight stereo audio signal samples of different varieties were used in the study. Subjective (mean opinion score&amp;#8212;MOS) listening tests were performed and the subjective difference grades (SDG) were used to compare the performance of the proposed coder with MP3, AAC, and HE-AAC encoders. Compression ratios in the range of 8 to 40 were achieved by the proposed technique with subjective difference grades (SDG) ranging from &amp;#8211;0.53 to &amp;#8211;2.27.</description><Author>Karthikeyan Umapathy and Sridhar Krishnan</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Linear Prediction Using Refined Autocorrelation Function</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/45962</link><description>This paper proposes a new technique for improving the performance of linear prediction analysis by utilizing a refined version of the autocorrelation function. Problems in analyzing voiced speech using linear prediction occur often due to the harmonic structure of the excitation source, which causes the autocorrelation function to be an aliased version of that of the vocal tract impulse response. To estimate the vocal tract characteristics accurately, however, the effect of aliasing must be eliminated. In this paper, we employ homomorphic deconvolution technique in the autocorrelation domain to eliminate the aliasing effect occurred due to periodicity. The resulted autocorrelation function of the vocal tract impulse response is found to produce significant improvement in estimating formant frequencies. The accuracy of formant estimation is verified on synthetic vowels for a wide range of pitch frequencies typical for male and female speakers. The validity of the proposed method is also illustrated by inspecting the spectral envelopes of natural speech spoken by high-pitched female speaker. The synthesis filter obtained by the current method is guaranteed to be stable, which makes the method superior to many of its alternatives.</description><Author>M. Shahidur Rahman and Tetsuya Shimamura</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>On the Utility of Syllable-Based Acoustic Models for Pronunciation
Variation Modelling</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/46460</link><description>Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.</description><Author>Annika H&amp;#228;m&amp;#228;l&amp;#228;inen, Lou Boves, Johan de Veth, and Louis ten Bosch</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>
      Detection and Separation of Speech Events in Meeting Recordings Using a Microphone Array</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/27616</link><description>When applying automatic speech recognition (ASR) to meeting recordings including spontaneous speech, the performance of ASR is greatly reduced by the overlap of speech events. In this paper, a method of separating the overlapping speech events by using an adaptive beamforming (ABF) framework is proposed. The main feature of this method is that all the information necessary for the adaptation of ABF, including microphone calibration, is obtained from meeting recordings based on the results of speech-event detection. The performance of the separation is evaluated via ASR using real meeting recordings.</description><Author>Futoshi Asano, Kiyoshi Yamamoto, Jun Ogata, Miichi Yamada, and Masami Nakamura</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/65420</link><description>We describe an FFT-based companding algorithm for preprocessing speech before 
    recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5&amp;#37; at &amp;#x2212;5 dB signal-to-noise ratio (SNR) and by 6.2&amp;#37; across all SNRs (&amp;#x2212;5 dB SNR to +15 dB SNR). In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.</description><Author>Bhiksha Raj, Lorenzo Turicchia, Bent Schmidt-Nielsen, and Rahul Sarpeshkar</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item></channel></rss>