﻿<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>EURASIP Journal on Audio, Speech, and Music Processing</title><link>http://www.hindawi.com</link><description>The latest articles from Hindawi Publishing Corporation</description><copyright>&amp;#169; 2012, Hindawi Publishing Corporation. All rights reserved.</copyright><item><title>Phoneme and Sentence-Level Ensembles for Speech Recognition</title><link>http://www.hindawi.com/journals/asmp/2011/426792/</link><description>We address the question of whether and how boosting and bagging
can be used for speech recognition. In order to do this, we compare two
different boosting schemes, one at the phoneme level and one at the
utterance level, with a phoneme-level bagging scheme. We control for
many parameters and other choices, such as the state inference scheme
used. In an unbiased experiment, we clearly show that the gain of boosting
methods compared to a single hidden Markov model is in all cases only
marginal, while bagging significantly outperforms all other methods. We
thus conclude that bagging methods, which have so far been overlooked
in favour of boosting, should be examined more closely as a potentially
useful ensemble learning technique for speech recognition.</description><Author>Christos Dimitrakakis and Samy Bengio</Author><copyright>Copyright &amp;#xa9; 2011 Christos Dimitrakakis and Samy Bengio. All rights reserved.</copyright></item><item><title>Multiple Source Localization Based on Acoustic Map De-Emphasis</title><link>http://www.hindawi.com/journals/asmp/2010/147495/</link><description>This paper describes a novel approach for localization of multiple sources overlapping in time. The proposed algorithm relies on
acoustic maps computed in multi-microphone settings, which are descriptions of the distribution of the acoustic activity in a monitored area. Through a proper processing of the acoustic maps, the positions of two or more simultaneously active acoustic sources can be estimated in a robust way. Experimental results obtained on real data collected for this specific task show the capabilities of the given method both with distributed microphone networks and with compact arrays.</description><Author>Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer</Author><copyright>Copyright &amp;#xa9; 2010 Alessio Brutti et al. All rights reserved.</copyright></item><item><title>Pitch Ranking, Melody Contour and Instrument Recognition Tests Using Two Semitone Frequency Maps for Nucleus Cochlear Implants</title><link>http://www.hindawi.com/journals/asmp/2010/948565/</link><description>To overcome harmonic structure distortions of complex tones in the low frequency range due to the frequency to electrode mapping function used in Nucleus cochlear implants, two modified frequency maps based on a semitone frequency scale (Smt-MF and Smt-LF) were implemented and evaluated. The semitone maps were compared against standard mapping in three psychoacoustic experiments with the three mappings; pitch ranking, melody contour identification (MCI) and instrument recognition. In the pitch ranking test, two tones were presented to normal hearing (NH) subjects. The MCI test presented different acoustic patterns to NH and CI recipients to identify the patterns. In the instrument recognition (IR) test, a musical piece was played by eight instruments which subjects had to identify. Pitch ranking results showed improvements with semitone mapping over Std mapping. This was reflected in the MCI results with both NH subjects and CI recipients. Smt-LF sounded unnaturally high-pitched due to frequency transposition. Clarinet recognition was significantly enhanced with Smt-MF but the average IR decreased.  Pitch ranking and MCI showed improvements with semitone mapping over Std mapping. However, the frequency limits of Smt-LF and Smt-MF produced difficulties when partials were filtered out due to the frequency limits. Although Smt-LF provided better pitch ranking and MCI, the perceived sounds were much higher in pitch and some CI recipients disliked it. Smt-MF maps the tones closer to their natural characteristic frequencies and probably sounded more natural than Smt-LF.</description><Author>Sherif A. Omran, Waikong Lai, and Norbert Dillier</Author><copyright>Copyright &amp;#xa9; 2010 Sherif A. Omran et al. All rights reserved.</copyright></item><item><title>Monaural Voiced Speech Segregation Based on Dynamic Harmonic Function</title><link>http://www.hindawi.com/journals/asmp/2010/252374/</link><description>Correlogram is an important representation for periodic signals. It is widely used in pitch estimation and source separation. For these applications, major problems of correlogram are its low resolution and redundant information. This paper proposes a voiced speech segregation system based on a newly introduced concept called dynamic harmonic function (DHF). In the proposed system, conventional correlograms are further processed by replacing the autocorrelation function (ACF) with DHF. The advantages of DHF are: 1) peak&amp;#39;s width is adjustable by controlling the variance of the Gaussian function and 2) the invalid peaks of ACF, not at the pitch period, tend to be suppressed. Based on DHF, pitch detection and effective source segregation algorithms are proposed. Our system is systematically evaluated and compared with the correlogram-based system. Both the signal-to-noise ratio results and the perceptual evaluation of speech quality scores show that the proposed system yields substantially better performance.</description><Author>Xueliang Zhang, Wenju Liu, and Bo Xu</Author><copyright>Copyright &amp;#xa9; 2010 Xueliang Zhang et al. All rights reserved.</copyright></item><item><title>A Novel MPEG Audio Degrouping Algorithm and Its Architecture Design</title><link>http://www.hindawi.com/journals/asmp/2010/737450/</link><description>Degrouping is the key component in MPEG Layer II audio decoding. It mainly contains
the arithmetic operations of division and modulo. So far no dedicated degrouping algorithm and architecture is well realized. In the paper we propose a novel degrouping algorithm and its architecture design with low complexity design consideration. Our approach relies on only using the addition and subtraction instead of the division and modulo arithmetic operations.
By use of this technique, it achieves the equivalent result without any loss of accuracy. The proposed design is without any multiplier, divider and ROM table and thus it can reduce the design complexity and chip area. In addition, it does not need any programming effort on numerical analysis. The result shows that it takes the advantages of simple and low cost
design. Furthermore, it achieves high efficiency on fixed throughput with only one clock cycle per sample. The VLSI implementation result indicates the gate counts are only 527.</description><Author>Tsung-Han Tsai</Author><copyright>Copyright &amp;#xa9; 2010 Tsung-Han Tsai. All rights reserved.</copyright></item><item><title>Instrumental Estimation of E-Model Parameters for  Wideband Speech Codecs</title><link>http://www.hindawi.com/journals/asmp/2010/782731/</link><description>A method is described for quantifying the quality of wideband speech codecs. Two parameters are
derived from signal-based speech quality model estimations: (i) a wideband equipment impairment factor Ie,WB and (ii) a wideband packet-loss robustness factor Bpl,WB. The equipment impairment factor can be combined with impairment factors for other quality
degradations to form an estimate of the overall conversational quality R of a wideband communication scenario, using a wideband extension of the E-model. The packet-loss robustness factor captures the robustness of the codec against packet-loss degradations. In contrast to past work, these parameters are no longer determined on the basis of auditory test results, but from signal-based speech quality models. We applied three intrusive models to several databases and compared the derived quality estimates and impairment factors to those obtained from auditory tests. The results show that  when migrating from narrowband to wideband transmission&amp;#x02014;a quality improvement of roughly 30&amp;#x00025; can be obtained, which is very similar to the one observed in auditory tests. The estimated impairment factors show a high correlation to those derived from auditory scores.  Congruences and
discrepancies to auditory test results are discussed, and an outline of work necessary to set up a wideband or even superwideband
E-model is given.</description><Author>Sebastian M&amp;#246;ller, Nicolas C&amp;#244;t&amp;#233;, Val&amp;#233;rie Gautier-Turbin, Nobuhiko Kitawaki, and Akira Takahashi</Author><copyright>Copyright &amp;#xa9; 2010 Sebastian M&amp;#xf6;ller et al. All rights reserved.</copyright></item><item><title>The Effect of a Voice Activity Detector on the Speech Enhancement Performance of the Binaural Multichannel Wiener Filter</title><link>http://www.hindawi.com/journals/asmp/2010/840294/</link><description>A multimicrophone speech enhancement algorithm for binaural hearing aids that preserves interaural time delays was proposed recently. The algorithm is based on multichannel Wiener filtering and relies on a voice activity detector (VAD) for estimation of second-order statistics. Here, the effect of a VAD on the speech enhancement of this algorithm was evaluated using an envelope-based VAD, and the performance was compared to that achieved using an ideal error-free VAD. The performance was considered for stationary directional noise and nonstationary diffuse noise interferers at input SNRs from &amp;#x02212;10 to +5&amp;#x02009;dB. Intelligibility-weighted SNR improvements of about 20&amp;#x2009;dB and 6&amp;#x2009;dB were found for the directional and diffuse noise, respectively. No large degradations (&amp;#x003C;1&amp;#x2009;dB) due to the use of envelope-based VAD were found down to an input SNR of 0&amp;#x2009;dB for the directional noise and &amp;#x02212;5&amp;#x2009;dB for the diffuse noise. At lower input SNRs, the improvement decreased gradually to 15&amp;#x2009;dB for the directional noise and 3&amp;#x2009;dB for the diffuse noise.</description><Author>Jasmina Catic, Torsten Dau, J&amp;#246;rg M. Buchholz, and Fredrik Gran</Author><copyright>Copyright &amp;#xa9; 2010 Jasmina Catic et al. All rights reserved.</copyright></item><item><title>Optimizing the Directivity of Multiway Loudspeaker Systems</title><link>http://www.hindawi.com/journals/asmp/2010/928439/</link><description>In multiway loudspeaker systems, digital signal processing techniques have been used to correct the frequency response, the propagation time, and the lobbing errors. These solutions are mainly based on correcting the delays between the signals coming from loudspeaker system transducers, and they still show limited performances over the overlap frequency bands. In this paper, we propose an enhanced optimization of relevant directivity characteristics of a multiway
loudspeaker system such as the frequency response, the radiation pattern, and the directivity index over an extended transducers&amp;#39; frequency overlap bands. The optimization process is based on applying complex weights to the crossover filter transfer functions by using an iterative approach.</description><Author>Hmaied Shaiek and Jean Marc Boucher</Author><copyright>Copyright &amp;#xa9; 2010 Hmaied Shaiek and Jean Marc Boucher. All rights reserved.</copyright></item><item><title>On the Characterization of Slowly Varying Sinusoids</title><link>http://www.hindawi.com/journals/asmp/2010/941732/</link><description>We give a brief discussion on the amplitude and frequency variation rates of the sinusoid representation of signals. In particular, we derive three inequalities that show that these rates are upper bounded by the 2nd and 4th spectral moments, which, in a loose sense, indicates that every complex signal with narrow short-time bandwidths is a slowly varying sinusoid. Further discussions are given to show how this result helps providing extra insights into relevant signal processing techniques.</description><Author>Xue Wen and Mark Sandler</Author><copyright>Copyright &amp;#xa9; 2010 Xue Wen and Mark Sandler. All rights reserved.</copyright></item><item><title>Efficient Advertisement Discovery for Audio Podcast Content Using Candidate Segmentation</title><link>http://www.hindawi.com/journals/asmp/2010/572571/</link><description>Nowadays, audio podcasting has been widely used by many online sites such as newspapers, web portals, journals, and so forth, to deliver audio content to users through download or subscription. Within 1 to 30 minutes long of one podcast story, it is often that multiple audio advertisements (ads) are inserted into and repeated, with each of a length of 5 to 30 seconds, at different locations. Automatic detection of these attached ads is a challenging task due to the complexity of the search algorithms. Based on the knowledge of typical structures of podcast contents, this paper proposes a novel efficient advertisement discovery approach for large audio podcasting collections. The proposed approach offers a significant improvement on search speed with sufficient accuracy. The key to the acceleration comes from the advantages of candidate segmentation and sampling technique introduced to reduce both search areas and number of matching frames.   The approach has been tested over a variety of podcast contents collected from MIT Technology Review, Scientific American, and Singapore Podcast websites. Experimental results show that the proposed algorithm archives detection rate of 97.5&amp;#37; with a significant computation saving as compared to existing state-of-the-art methods.</description><Author>M. N. Nguyen, Qi Tian, and Ping Xue</Author><copyright>Copyright &amp;#xa9; 2010 M. N. Nguyen et al. All rights reserved.</copyright></item><item><title>Correlation-Based Amplitude Estimation of Coincident Partials in Monaural Musical Signals</title><link>http://www.hindawi.com/journals/asmp/2010/523791/</link><description>This paper presents a method for estimating the amplitude of coincident partials generated by harmonic musical sources (instruments and vocals). It was developed as an alternative to the commonly
used interpolation approach, which has several limitations in terms of performance and applicability.
The strategy is based on the following observations: (a) the parameters of partials vary with time; (b)
such a variation tends to be correlated when the partials belong to the same source; (c) the presence
of an interfering coincident partial reduces the correlation; and (d) such a reduction is proportional to
the relative amplitude of the interfering partial. Besides the improved accuracy, the proposed technique
has other advantages over its predecessors: it works properly even if the sources have the same fundamental frequency, it is able to estimate the first partial (fundamental), which is not possible using the
conventional interpolation method, it can estimate the amplitude of a given partial even if its neighbors
suffer intense interference from other sources, it works properly under noisy conditions, and it is immune
to intraframe permutation errors. Experimental results show that the strategy clearly outperforms the
interpolation approach.</description><Author>Jayme Garcia Arnal Barbedo and George Tzanetakis</Author><copyright>Copyright &amp;#xa9; 2010 Jayme Garcia Arnal Barbedo and George Tzanetakis. All rights reserved.</copyright></item><item><title>Employing Second-Order Circular Suprasegmental Hidden Markov Models to Enhance Speaker Identification Performance in Shouted Talking Environments</title><link>http://www.hindawi.com/journals/asmp/2010/862138/</link><description>Speaker identification performance is almost perfect in neutral talking environments. However, the performance is deteriorated significantly in shouted talking environments. This work is devoted to proposing, implementing, and evaluating new models called Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) to alleviate the deteriorated performance in the shouted talking environments. These proposed models possess the characteristics of both Circular Suprasegmental Hidden Markov Models (CSPHMMs) and Second-Order Suprasegmental Hidden Markov Models (SPHMM2s). The results of this work show that CSPHMM2s outperform each of First-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM1s), Second-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM2s), and First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s) in the shouted talking environments. In such talking environments and using our collected speech database, average speaker identification performance based on LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s, and CSPHMM2s is 74.6&amp;#37;, 78.4&amp;#37;, 78.7&amp;#37;, and 83.4&amp;#37;, respectively. Speaker identification performance obtained based on CSPHMM2s is close to that obtained based on subjective assessment by human listeners.</description><Author>Ismail Shahin</Author><copyright>Copyright &amp;#x00A9; 2010 Ismail Shahin. All rights reserved.</copyright></item><item><title>Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions</title><link>http://www.hindawi.com/journals/asmp/2010/651420/</link><description>When a number of speakers are simultaneously
active, for example in meetings or noisy public places, the sources
of interest need to be separated from interfering speakers and
from each other in order to be robustly recognized. Independent
component analysis (ICA) has proven a valuable tool for this
purpose. However, ICA outputs can still contain strong residual
components of the interfering speakers whenever noise or reverberation
is high. In such cases, nonlinear postprocessing can be
applied to the ICA outputs, for the purpose of reducing remaining
interferences. In order to improve robustness to the artefacts and
loss of information caused by this process, recognition can be
greatly enhanced by considering the processed speech feature
vector as a random variable with time-varying uncertainty,
rather than as deterministic. The aim of this paper is to show
the potential to improve recognition of multiple overlapping
speech signals through nonlinear postprocessing together with
uncertainty-based decoding techniques.</description><Author>Dorothea Kolossa, Ramon Fernandez Astudillo, Eugen Hoffmann, and Reinhold Orglmeister</Author><copyright>Copyright &amp;#x00A9; 2010 Dorothea Kolossa et al. All rights reserved.</copyright></item><item><title>Adaptive Long-Term Coding of LSF Parameters Trajectories for Large-Delay/Very- to Ultra-Low Bit-Rate Speech Coding</title><link>http://www.hindawi.com/journals/asmp/2010/597039/</link><description>This paper presents a model-based method for coding the LSF parameters of LPC speech coders on a &amp;#8220;long-term&amp;#8221; basis, that is, beyond the usual 20&amp;#x2013;30&amp;#x2009;ms frame duration. The objective is to provide efficient LSF quantization for a speech coder with large delay but very- to ultra-low bit-rate (i.e., below 1&amp;#x2009;kb/s). To do this, speech is first segmented into voiced/unvoiced segments. A Discrete Cosine model of the time trajectory of the LSF vectors is then applied to each segment to capture the LSF interframe correlation over the whole segment. Bi-directional transformation from the model coefficients to a reduced set of LSF vectors enables both efficient  &amp;#8220;sparse&amp;#8221; coding (using here multistage vector quantizers) and the generation of interpolated LSF vectors at the decoder. The proposed method provides up to 50&amp;#37; gain in bit-rate over frame-by-frame quantization while preserving signal quality and competes favorably with 2D-transform coding for the lower range of tested bit rates. Moreover, the implicit time-interpolation nature of the long-term coding process provides this technique a high potential for use in speech synthesis systems.</description><Author>Laurent Girin</Author><copyright>Copyright &amp;#x00A9; 2010 Laurent Girin. All rights reserved.</copyright></item><item><title>Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition</title><link>http://www.hindawi.com/journals/asmp/2009/304579/</link><description>Fractional Fourier transform (FrFT) has been proposed to improve the time-frequency resolution in signal analysis and processing. However, selecting the FrFT transform order for the proper analysis of multicomponent signals like speech is still debated. In this work, we investigated several order adaptation methods. Firstly, FFT- and FrFT- based spectrograms of an artificially-generated vowel are compared to demonstrate the methods. Secondly, an acoustic feature set combining MFCC and FrFT is proposed, and the transform orders for the FrFT are adaptively set according to various methods based on pitch and formants. A tonal vowel discrimination test is designed to compare the performance of these methods using the feature set. The results show that the FrFT-MFCC yields a better discriminability of tones and also of vowels, especially by using multitransform-order methods. Thirdly, speech recognition experiments were conducted on the clean intervocalic English consonants provided by the Consonant Challenge. Experimental results show that the proposed features with different order adaptation methods can obtain slightly higher recognition rates compared to the reference MFCC-based recognizer.</description><Author>Hui Yin, Climent Nadeu, and Volker Hohmann</Author><copyright>Copyright &amp;#x00A9; 2009 Hui Yin et al. All rights reserved.</copyright></item><item><title>Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition</title><link>http://www.hindawi.com/journals/asmp/2009/140575/</link><description>We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the &amp;#8220;query relevance.&amp;#8221; Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. 
Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29&amp;#37;) was improved to 60.38&amp;#37;, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25&amp;#37;).</description><Author>Akinori Ito, Yasutomo Kajiura, Motoyuki Suzuki, and Shozo Makino</Author><copyright>Copyright &amp;#x00A9; 2009 Akinori Ito et al. All rights reserved.</copyright></item><item><title>Drum Sound Detection in Polyphonic Music with Hidden Markov Models</title><link>http://www.hindawi.com/journals/asmp/2009/497292/</link><description>This paper proposes a method for transcribing drums from polyphonic 
music using a network of connected hidden Markov models (HMMs). The task 
is to detect the temporal locations of unpitched percussive sounds (such 
as bass drum or hi-hat) and recognise the instruments played. Contrary 
to many earlier methods, a separate sound event segmentation is not 
done, but connected HMMs are used to perform the segmentation and 
recognition jointly. Two ways of using HMMs are studied: modelling 
combinations of the target drums and a detector-like modelling of each 
target drum. Acoustic feature parametrisation is done with mel-frequency 
cepstral coefficients and their first-order temporal derivatives. The 
effect of lowering the feature dimensionality with principal component 
analysis and linear discriminant analysis is evaluated. Unsupervised 
acoustic model parameter adaptation with maximum likelihood linear 
regression is evaluated for compensating the differences between the 
training and target signals. The performance of the proposed method is 
evaluated on a publicly available data set containing signals with and 
without accompaniment, and compared with two reference methods. The 
results suggest that the transcription is possible using connected HMMs, 
and that using detector-like models for each target drum provides a 
better performance than modelling drum combinations.</description><Author>Jouni Paulus and Anssi Klapuri</Author><copyright>Copyright &amp;#x00A9; 2009 Jouni Paulus and Anssi Klapuri. All rights reserved.</copyright></item><item><title>Compact Acoustic Models for Embedded Speech Recognition</title><link>http://www.hindawi.com/journals/asmp/2009/806186/</link><description>Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training
data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques) with no need for state-dependent data. The whole approach results in a relative gain of more than 20&amp;#37; compared to a basic HMM-based system fitting the constraints.</description><Author>Christophe L&amp;#233;vy, Georges Linar&amp;#232;s, and Jean-Fran&amp;#231;ois Bonastre</Author><copyright>Copyright &amp;#x00A9; 2009 Christophe L&amp;#233;vy et al. All rights reserved.</copyright></item><item><title>An Adaptive Framework for Acoustic Monitoring of Potential Hazards</title><link>http://www.hindawi.com/journals/asmp/2009/594103/</link><description>Robust recognition of general audio events constitutes a topic of intensive research in the signal processing community. This work presents an efficient methodology for acoustic surveillance of atypical situations which can find use under different acoustic backgrounds. The primary goal is the continuous acoustic monitoring of a scene for potentially hazardous events in order to help an authorized officer to take the appropriate actions towards preventing human loss and/or property damage. A probabilistic hierarchical scheme is designed based on Gaussian mixture models and state-of-the-art sound parameters selected through extensive experimentation. A feature of the proposed system is its model adaptation loop that provides adaptability to different sound environments. We report extensive experimental results including installation in a real environment and operational detection rates for three days of function on a 24 hour basis. Moreover, we adopt a reliable testing procedure that demonstrates high detection rates as regards average recognition, miss probability, and false alarm rates.</description><Author>Stavros Ntalampiras, Ilyas Potamitis, and Nikos Fakotakis</Author><copyright>Copyright &amp;#x00A9; 2009 Stavros Ntalampiras et al. All rights reserved.</copyright></item><item><title>Performance Study of Objective Speech Quality Measurement for Modern Wireless-VoIP Communications</title><link>http://www.hindawi.com/journals/asmp/2009/104382/</link><description>Wireless-VoIP communications introduce perceptual degradations that are not present with traditional VoIP communications. This paper investigates the effects of such degradations on the performance of three state-of-the-art standard objective quality measurement algorithms&amp;#8212;PESQ, P.563, and an &amp;#8220;extended&amp;#8221; E-model. The comparative study suggests that measurement performance is significantly affected by acoustic background noise type and level as well as speech codec and packet loss concealment strategy. On our data, PESQ attains superior overall performance
and P.563 and E-model attain comparable performance figures.</description><Author>Tiago H. Falk and Wai-Yip Chan</Author><copyright>Copyright &amp;#x00A9; 2009 Tiago H. Falk and Wai-Yip Chan. All rights reserved.</copyright></item><item><title>Adaptive V/UV Speech Detection Based on Characterization of Background Noise</title><link>http://www.hindawi.com/journals/asmp/2009/965436/</link><description>The paper presents an adaptive system for Voiced/Unvoiced (V/UV) speech detection in the presence of background noise. Genetic algorithms were used to select the features that offer the best V/UV detection according to the output of a background Noise Classifier (NC) and a Signal-to-Noise Ratio Estimation (SNRE) system. The system was implemented, and the tests performed using the TIMIT speech corpus and its phonetic classification. The results were compared with a nonadaptive classification system and the V/UV detectors adopted by two important speech coding standards: the V/UV detection system in the ETSI ES 202 212 v1.1.2 and the speech classification in the Selectable Mode Vocoder (SMV) algorithm. In all cases the proposed adaptive V/UV classifier outperforms the traditional solutions giving an improvement of 25&amp;#37; in very noisy environments.</description><Author>F. Beritelli, S. Casale, A. Russo, and S. Serrano</Author><copyright>Copyright &amp;#x00A9; 2009 F. Beritelli et al. All rights reserved.</copyright></item><item><title>Signal Processing Implementation and Comparison of Automotive Spatial Sound Rendering Strategies</title><link>http://www.hindawi.com/journals/asmp/2009/876297/</link><description>Design and implementation strategies of spatial sound rendering are investigated in this paper for automotive scenarios. Six design methods are implemented for various rendering modes with different number of passengers. Specifically, the downmixing algorithms aimed at balancing the front and back reproductions are developed for the 5.1-channel input. Other five algorithms based on inverse filtering are implemented in two approaches. The first approach utilizes binaural (Head-Related Transfer Functions HRTFs) measured in the car interior, whereas the second approach named the point-receiver model targets a point receiver positioned at the center of the passenger&amp;#39;s head. The proposed processing algorithms were compared via objective and subjective experiments under various listening conditions. Test data were processed by the multivariate analysis of variance (MANOVA) method and the least significant difference (Fisher&amp;#39;s LSD) method as a post hoc test to justify the statistical significance of the experimental data. The results indicate that inverse filtering algorithms are preferred for the single passenger mode. For the multipassenger mode, however, downmixing algorithms generally outperformed the other processing techniques.</description><Author>Mingsian R. Bai and Jhih-Ren Hong</Author><copyright>Copyright &amp;#x00A9; 2009 Mingsian R. Bai and Jhih-Ren Hong. All rights reserved.</copyright></item><item><title>Tracking Intermittently Speaking Multiple Speakers Using a Particle Filter</title><link>http://www.hindawi.com/journals/asmp/2009/673202/</link><description>The problem of tracking multiple intermittently speaking speakers is difficult as some distinct problems must be addressed. The number of active speakers must be estimated, these active speakers must be identified, and the locations of all speakers including inactive speakers must be tracked. In this paper we propose a method for tracking intermittently speaking multiple speakers using a particle filter. In the proposed algorithm the number of active speakers is firstly estimated based on the Exponential Fitting Test (EFT), a source number estimation technique which we have proposed. The locations of the speakers are then tracked using a particle filtering framework within which the decomposed likelihood is used in order to decouple the observed audio signal and associate each element of the decomposed signal with an active speaker. The tracking accuracy is then further improved by the inclusion of a silence region detection step and estimation of the noise-only covariance matrix. The method was evaluated using live recordings of 3 speakers and the results show that the method produces highly accurate tracking results.</description><Author>Angela Quinlan, Mitsuru Kawamoto, Yosuke Matsusaka, Hideki Asoh, and Futoshi Asano</Author><copyright>Copyright &amp;#x00A9; 2009 Angela Quinlan et al. All rights reserved.</copyright></item><item><title>Musical Sound Separation Based on Binary Time-Frequency Masking</title><link>http://www.hindawi.com/journals/asmp/2009/130567/</link><description>The problem of overlapping harmonics is particularly acute in musical sound separation and has not been addressed adequately. We propose a monaural system based on binary time-frequency masking with an emphasis on robust decisions in time-frequency regions, where harmonics from different sources overlap. Our computational auditory scene analysis system exploits the observation that sounds from the same source tend to have similar spectral envelopes. Quantitative results show that utilizing spectral similarity helps binary decision making in overlapped time-frequency regions and significantly improves
separation performance.</description><Author>Yipeng Li and DeLiang Wang</Author><copyright>Copyright &amp;#x00A9; 2009 Yipeng Li and DeLiang Wang. All rights reserved.</copyright></item><item><title>Analysis of Salient Feature Jitter in the Cochlea for Objective Prediction of Temporally Localized Distortion in Synthesized Speech</title><link>http://www.hindawi.com/journals/asmp/2009/865723/</link><description>Temporally localized distortions account for the highest variance in subjective evaluation of coded speech signals (Sen (2001) and Hall (2001)). The ability to discern and decompose perceptually relevant temporally localized coding noise from other types of distortions is both of theoretical importance as well as a valuable tool for deploying and designing speech synthesis systems. The work described within uses a physiologically motivated cochlear model to provide a tractable analysis of salient feature trajectories as processed by the cochlea. Subsequent statistical analysis shows simple relationships between the jitter of these trajectories and temporal attributes of the Diagnostic Acceptability Measure (DAM).</description><Author>Wenliang Lu and D. Sen</Author><copyright>Copyright &amp;#x00A9; 2009 Wenliang Lu and D. Sen. All rights reserved.</copyright></item><item><title>A Decision-Tree-Based  Algorithm for Speech/Music Classification and Segmentation</title><link>http://www.hindawi.com/journals/asmp/2009/239892/</link><description>We present an efficient algorithm for segmentation of audio signals into speech or music. The central motivation to our study is consumer audio applications, where various real-time enhancements are often applied. The algorithm consists of a learning phase and a classification phase. In the learning phase, predefined training data is used for computing various time-domain and frequency-domain features, for speech and music signals separately, and estimating the optimal speech/music thresholds, based on the probability density functions of the features. An automatic procedure is employed to select the best features for separation. In the test phase, initial classification is performed for each segment of the audio signal, using a three-stage sieve-like approach, applying both Bayesian and rule-based methods.  To avoid erroneous rapid alternations in the classification, a smoothing technique is applied, averaging the decision on each segment with past segment decisions.  Extensive evaluation of the algorithm, on a database of more than 12 hours of speech and more than 22 hours of music showed correct identification rates of 99.4&amp;#37; and 97.8&amp;#37;, respectively, and quick adjustment to alternating speech/music sections. In addition to its accuracy and robustness, the algorithm can be easily adapted to different audio types, and is suitable for real-time operation.</description><Author>Yizhar Lavner and Dima Ruinskiy</Author><copyright>Copyright &amp;#x00A9; 2009 Yizhar Lavner and Dima Ruinskiy. All rights reserved.</copyright></item><item><title>Integrated Phoneme Subspace Method for Speech Feature Extraction</title><link>http://www.hindawi.com/journals/asmp/2009/690451/</link><description>Speech feature extraction has been a key focus in robust speech recognition research. In this work, we discuss data-driven
linear feature transformations applied to feature vectors in the logarithmic mel-frequency filter bank domain. Transformations are based on principal component analysis (PCA), independent component analysis (ICA), and linear discriminant analysis (LDA). Furthermore, this paper introduces a new feature extraction technique that collects the correlation information among phoneme subspaces and reconstructs feature space for representing phonemic information efficiently. The proposed speech feature vector is generated by projecting an observed vector onto an integrated phoneme subspace (IPS) based on PCA or ICA. The performance of the new feature was evaluated for isolated word speech recognition. The proposed method provided higher recognition accuracy than conventional methods in clean and reverberant environments.</description><Author>Hyunsin Park, Tetsuya Takiguchi, and Yasuo Ariki</Author><copyright>Copyright &amp;#x00A9; 2009 Hyunsin Park et al. All rights reserved.</copyright></item><item><title>Analysis of Damped Mass-Spring Systems for Sound Synthesis</title><link>http://www.hindawi.com/journals/asmp/2009/947823/</link><description>There are many ways of synthesizing sound on a computer. The method that we consider, called a mass-spring system, synthesizes sound by simulating the vibrations of a network of interconnected masses, springs, and dampers. Numerical methods are required to approximate the differential equation of a mass-spring system. The standard 
numerical method used in implementing mass-spring systems for use in sound synthesis is the symplectic Euler 
method. Implementers and users of mass-spring systems should be aware of the limitations of the numerical methods used; in 
particular we are interested in the stability and accuracy of the numerical methods used. We present an analysis of the symplectic 
Euler method that shows the conditions under which the method is stable and the accuracy of the decay rates and frequencies of the 
sounds produced.</description><Author>Don Morgan and Sanzheng Qiao</Author><copyright>Copyright &amp;#x00A9; 2009 Don Morgan and Sanzheng Qiao. All rights reserved.</copyright></item><item><title>An Overview of the Coding Standard MPEG-4 Audio Amendments 1 and 2: HE-AAC, SSC, and HE-AAC v2</title><link>http://www.hindawi.com/journals/asmp/2009/468971/</link><description>In 2003 and 2004, the ISO/IEC MPEG standardization committee added two amendments to their MPEG-4 audio coding standard. These amendments concern parametric coding techniques and encompass Spectral Band Replication (SBR), Sinusoidal Coding (SSC), and Parametric Stereo (PS). In this paper, we will give an overview of the basic ideas behind these techniques and references to more detailed information. Furthermore, the results of listening tests as performed during
the final stages of the MPEG-4 standardization process are presented in order to illustrate the performance of these techniques.</description><Author>A. C. den Brinker, J. Breebaart, P. Ekstrand, J. Engdeg&amp;#xE5;rd, F. Henn, K. Kj&amp;#246;rling, W. Oomen, and H. Purnhagen</Author><copyright>Copyright &amp;#x00A9; 2009 A. C. den Brinker et al. All rights reserved.</copyright></item><item><title>Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement</title><link>http://www.hindawi.com/journals/asmp/2009/942617/</link><description>Performance of speech recognition systems strongly degrades in the presence of background noise, like the driving noise inside a car. In contrast to existing works, we aim to improve noise robustness focusing on all major levels of speech recognition: feature extraction, feature enhancement, speech modelling, and training. Thereby, we give an overview of promising auditory modelling concepts, speech enhancement techniques, training strategies, and model architecture, which are implemented in an in-car digit and spelling recognition task considering noises produced by various car types and driving conditions. We prove that joint speech and noise modelling with a Switching Linear Dynamic Model (SLDM) outperforms speech enhancement techniques like Histogram Equalisation (HEQ) with a mean relative error reduction of 52.7&amp;#37; over various noise types and levels. Embedding a Switching Linear Dynamical System (SLDS) into a Switching Autoregressive Hidden Markov Model (SAR-HMM) prevails for speech disturbed by additive white Gaussian noise.</description><Author>Bj&amp;#246;rn Schuller, Martin W&amp;#246;llmer, Tobias Moosmayr, and Gerhard Rigoll</Author><copyright>Copyright &amp;#x00A9; 2009 Bj&amp;#246;rn Schuller et al. All rights reserved.</copyright></item></channel></rss>
