Superwideband Bandwidth Extension Using Normalized MDCT Coefficients for Scalable Speech and Audio Coding

Lee, Young Han; Choi, Seung Ho

doi:https://doi.org/10.1155/2013/909124

Advances in Multimedia

On this page

Abstract Introduction Conclusion References Copyright Related Articles

Research Article | Open Access

Volume 2013 | Article ID 909124 | https://doi.org/10.1155/2013/909124

Superwideband Bandwidth Extension Using Normalized MDCT Coefficients for Scalable Speech and Audio Coding

Young Han Lee¹and Seung Ho Choi²

Academic Editor: Jinan Fiaidhi

Received30 Apr 2013

Accepted12 Jun 2013

Published08 Jul 2013

Abstract

A bandwidth extension (BWE) algorithm from wideband to superwideband (SWB) is proposed for a scalable speech/audio codec that uses modified discrete cosine transform (MDCT) coefficients as spectral parameters. The superwideband is first split into several subbands that are represented as gain parameters and normalized MDCT coefficients in the proposed BWE algorithm. We then estimate normalized MDCT coefficients of the wideband to be fetched for the superwideband and quantize the fetch indices. After that, we quantize gain parameters by using relative ratios between adjacent subbands. The proposed BWE algorithm is embedded into a standard superwideband codec, the SWB extension of G.729.1 Annex E, and its bitrate and quality are compared with those of the BWE algorithm already employed in the standard superwideband codec. It is shown from the comparison that the proposed BWE algorithm relatively reduces the bitrate by around 19% with better quality, compared to the BWE algorithm in the SWB extension of G.729.1 Annex E.

1. Introduction

In early speech communication services, narrowband codecs having a bandwidth of around 3.4 kHz were commonly used since the available network bandwidth was quite limited. These services could provide sufficient quality for comprehension, but it was generally agreed that they did not satisfy users' increasing expectations for higher sound quality. Due to the advances in network technologies, however, this transmission bandwidth has recently been increased [1–3]. Thus, a great deal of research has been focused on further extending the bandwidth of speech and/or audio signals from narrowband to wideband, superwideband, and audio band [4–6].

There are two different kinds of approaches for extending the bandwidth according to whether or not the side information is available, as shown in Figure 1. As depicted in Figure 1(a), it is usual to realize bandwidth extension by using the side information that is transmitted from the encoder. On the other hand, it is also possible to extend bandwidth only at the decoder without any side information [7], which is shown in Figure 1(b). In other words, instead of using the side information, artificial bandwidth extension can estimate the higher band signal from the lower band signal by using a pattern recognition algorithm such as hidden Markov models (HMMs) [8] Gaussian mixture models (GMMs) [9] and [10–15]. While artificial bandwidth extension algorithms do not require any additional bits for sending the side information, their performance is somewhat restricted depending on the performance of the pattern recognition algorithm used in the extension, compared with the bandwidth extension using the side information [16–20]. Thus, an issue that arose in this bandwidth extension approach is how to improve the performance of the bandwidth extension algorithm with lower increase of the bitrate for the side information.

(a)

(b)

In this paper, we propose a superwideband bandwidth extension (BWE) algorithm using normalized spectral coefficients for scalable speech and audio coding, where speech and audio signals are transformed into the spectral domain via a modified discrete cosine transform (MDCT). To this end, the superwideband is split into several subbands, and MDCT coefficients belonging to each subband are normalized by the subband gain that is obtained by squared sum of each MDCT coefficient. We then estimate normalized MDCT coefficients appropriate for the superwideband in order to fetch them from the wideband. Moreover, to improve the coding efficiency, we incorporate a quantization scheme that deals with the relative ratio of gains between adjacent subbands, resulting in a reduced bitrate for the proposed BWE algorithm.

The remainder of this paper is organized as follows. A brief review on superwideband extension algorithm is given in Section 2. The proposed superwideband extension algorithm is described in Section 3. Performance evaluation is illustrated in Section 4. Finally, conclusions are given in Section 5.

2. Superwideband Extension Algorithm

In this section, we briefly review an existing superwideband (SWB) extension algorithm employed in G.729.1 Annex E [17]. The conventional SWB extension algorithm operates in the MDCT domain performed with a length of 640 samples, and it is comprised of a generic mode and a sinusoidal mode. In the generic mode, the higher band signals are reconstructed by transposing the lower band signals with a gain adjustment. However, since tonal signals, which are generated by an instrument such as a guitar, are characterized as the magnitude and the position of several tones, it is not proper to model them by using the generic mode. Thus, the sinusoidal mode tries to model directly tonal components. In fact, the mode selection is done by estimating the tonality of the input audio signals. Therefore, we are interested in how to improve the BWE algorithm realized as the generic mode in G.729.1 Annex E with the same mode selection and sinusoidal mode.

Figure 2 shows the structure of the conventional BWE algorithm employed as the generic mode in G.729.1 Annex E. As shown in the figure, the conventional BWE algorithm is composed of two blocks: one is a fetch index search and the other is gain compensation and scaling factor quantization block. As a first step, in order to obtain appropriate MDCT coefficients of the higher band, the maximum cross-correlation lag, , for the jth subband is determined as where and is the total number of subbands. In addition, is the number of MDCT coefficients belonging to the jth subband. In (2), is the kth higher band MDCT coefficient of the jth subband, and is the kth lower band MDCT coefficient reconstructed by the lower band decoder. In addition, and are the low and high boundaries for the fetch index search in the jth subband, respectively. The parameters are set as , , , and [17].

Next, in order to compensate for the mismatch between and for the jth subband, we compute two different scaling factors in the linear spectral domain as well as in the log spectral domain, and , which are defined as where , , and .

3. Proposed SWB Extension Algorithm

The proposed BWE algorithm has three major differences from the SWB extension algorithm described in Section 2. First, instead of quantizing the scale factor for each subband, the proposed algorithm quantizes the gain of each subband. Also, the replication of the higher band is done by using the MDCT coefficients normalized by the subband gains. Second, in order to improve the coding efficiency, the numbers of subbands are differently assigned for subband gains and fetch indices, where subband gains and fetch indices are related to the spectral shape and the spectral fine structure, respectively. However, since the threshold in quiet for the higher band is much greater than that for the lower band [21], a detailed representation of the spectral shape is more important than that of the spectral fine structure for the higher band. Thus, it would better to assign more bits to the subband gain quantization than the fetch index quantization. Therefore, we decide to provide more subbands for gain, which is referred to as gain subband, but less for the fetch index, which is referred to as fetch subband. Third, instead of using a maximum correlation criterion, we apply a minimum mean square error (MMSE) criterion for the fetch index search, which can provide better performance to minimize the spectral errors.

Figure 3 shows the structure of the proposed BWE algorithm. Compared to Figure 2, the proposed BWE algorithm consists of a gain normalization/quantization block and a fetch index search block. In the proposed BWE algorithm, we first estimate the subband gain of the jth gain subband, , using the equation of where is the total number of gain subbands and is the number of MDCT coefficients in the jth gain subband. In addition, is the kth higher band MDCT coefficient in the jth gain subband.

In this proposed BWE algorithm, we incorporate the concept of differential quantization to reduce the number of bits for quantizing the subband gains. In other words, we quantize the relative ratios of the subband gains, which corresponds to differential quantization applied in the log spectral domain. This is because the dynamic range of the relative ratio is smaller than that of the subband gain. That is, each subband gain can be quantized as where is an n-bit scalar quantizer for . Specifically, the gain of the first gain subband should be quantized with a higher bit quantizer than those of the rest of subband gains; that is, . Next, the MDCT coefficients are normalized by using the corresponding quantized gain, such that where is the kth higher band normalized MDCT coefficient belonging to the jth gain subband.

The fetch indices in the proposed BWE algorithm are then obtained based on the MMSE criterion. In other words, a fetch index, , for the lth fetch subband is found by using the equation of where and . In addition, is the total number of fetch subbands, is the number of MDCT coefficients belonging to the lth fetch subband, and and are the low and high boundaries of the lth fetch subband, respectively.

In the proposed BWE algorithm, several combinations are available by varying the settings for , , , and , whereas parameters such as the sinusoidal coding and envelope shaping parameters are used for the same operation as in [17].

4. Performance Evaluation

In this section, we compared the performance of the proposed BWE algorithm with that of the conventional BWE algorithm employed in G.729.1 Annex E [17] in terms of bitrate, spectrogram, and quality. Note here that we only replaced the generic mode of the BWE algorithm in the codec.

4.1. Bitrate

The number of bits/frame for the conventional BWE algorithm was 79 bits/frame, and 60 bits/frame out of them were for , , and (for ) in (1) and (3). Even if the BWE algorithm operated as the generic mode, a single sinusoidal component was always modeled, and its location, sign, and amplitude were quantized with 10 bits/frame. The remaining 9 bits/frame out of the 79 bits/frame were assigned for the mode indicator and the envelope shaping that was applied to reduce the pre- and postecho artifacts. For a more detailed explanation on the bit assignment for the conventional BWE algorithm, refer to the literature [17].

For the proposed BWE algorithm, there were a number of combinations for the bit assignment. To this end, we had performed exhaustive experiments by changing the design parameters of the proposed BWE algorithm. Eventually, it was found that by setting , , , and , the proposed BWE algorithm could provide better quality with a lower bitrate than the conventional BWE algorithm. It was also shown from the Table 1 that the proposed BWE algorithm achieved a relative bitrate reduction of around 19%, compared to the conventional BWE algorithm.

4.2. Spectrogram Comparison

Figures 4(b) and 4(c) show the spectrograms for the original audio signal (Figure 4(a)) when the conventional and proposed BWE algorithms were applied, respectively. Note here that the original signal taken from the sound quality assessment material (SQAM) [22] was sampled at a rate of 32 kHz, but the bandwidth of the decoded signal by G.729.1 Annex E was limited up to 14 kHz. In this paper, the spectrograms were obtained by applying a Blackman window whose length was 1,024 samples with an overlap of 512 samples to the decoded signals.

(a)

(b)

(c)

As shown in Figure 4(b), the spectrogram of decoded signal by the conventional BWE algorithm had horizontal lines at round 9.6 kHz and 11.6 kHz, as denoted by a dotted box, which were caused by the gain mismatch. This mismatch, however, was mitigated by using the proposed BWE algorithm; thus, there were no horizontal lines in Figure 4(b). These results implied that the proposed BWE algorithm could provide better quality than the conventional BWE algorithm.

4.3. Quality Preference Test

As a subjective measure, we conducted an AB preference test between the conventional BWE algorithm and the proposed one. Here, we chose three different signal types for the preference test, including speech, music, and noisy speech signals. Note here that speech signals were taken from the database [6] and others from SQAM [22]. We prepared five files for each signal type, where each file was preprocessed according to the G.729.1/G.718 SWB processing plan [23]. All files were then presented to nine listeners having no auditory disorders. Table 2 shows the preference test results. The proposed BWE showed higher scores for all the signal types than the conventional BWE. In particular, the strong periodicity of music signal resulted in the best preference score. It could be concluded from the table that the listeners preferred the signals decoded by the proposed BWE algorithm for all signals types, rather than those decoded by the conventional BWE algorithm.

5. Conclusion

In this paper, we proposed a superwideband bandwidth extension algorithm using normalized MDCT coefficients for scalable speech and audio coding. In the proposed algorithm, the gain of each subband is quantized, and the replication of the higher band is done by using the MDCT coefficients normalized by the subband gains. Also, the numbers of subbands are differently assigned according to subband gains and fetch indices. Moreover, we apply a minimum mean square error (MMSE) criterion for the fetch index search, which can provide better performance to minimize the spectral errors. As a result, it was shown from the spectrogram comparison and the AB preference test that the proposed BWE algorithm provided better quality, especially for the music signal than the BWE algorithm currently employed in G.729.1 Annex E, while it achieved a relative bitrate reduction of 18.99%.

Acknowledgment

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2013R1A1A2007971).

References

C. Lamblin, “Recent audio/speech coding developments in ITU-T and future trends,” in Proceedings of the European Signal Processing Conference (EUSIPCO '08), Plenary Lecture, Lausanne, Switzerland, 2008.
View at: Google Scholar
J. A. Kang and H. K. Kim, “An adaptive packet loss recovery method based on real-time speech quality assessment and redundant speech transmission,” International Journal of Innovative Computing, Information and Control, vol. 7, no. 12, pp. 6773–6783, 2011.
View at: Google Scholar
J. A. Kang and H. K. Kim, “Adaptive redundant speech transmission over wireless multimedia sensor networks based on estimation of perceived speech quality,” Sensors, vol. 11, no. 9, pp. 8469–8484, 2011.
View at: Publisher Site | Google Scholar
“ITU-T Temporal Document 298 R1,” Report of Q23/16 Rapporteur’s Meeting, 2008.
View at: Google Scholar
N. I. Park and H. K. Kim, “Artificial bandwidth extension of narrowband speech applied to CELP-type speech coding,” Information-International Interdisciplinary Journal, vol. 16, no. 3(B), pp. 3153–3164, 2013.
View at: Google Scholar
Y. R. Oh, Y. G. Kim, M. Kim, H. K. Kim, M. S. Lee, and H. J. Bae, “Phonetically balanced text corpus design using a similarity measure for a stereo super-wideband speech database,” IEICE Transactions on Information and Systems, vol. E94-D, no. 7, pp. 1459–1466, 2011.
View at: Publisher Site | Google Scholar
P. Vary and R. Martin, Digital Speech Transmission: Enhancement, Coding and Error Concealment, Wiley, Chichester, UK, 2006.
P. Jax and P. Vary, “On artificial bandwidth extension of telephone speech,” Signal Processing, vol. 83, no. 8, pp. 1707–1719, 2003.
View at: Publisher Site | Google Scholar
G.-B. Song and P. Martynovich, “A study of HMM-based bandwidth extension of speech signals,” Signal Processing, vol. 89, no. 10, pp. 2036–2044, 2009.
View at: Publisher Site | Google Scholar
U. Kornagel, “Techniques for artificial bandwidth extension of telephone speech,” Signal Processing, vol. 86, no. 6, pp. 1296–1306, 2006.
View at: Publisher Site | Google Scholar
H. Pulakka, L. Laaksonen, M. Vainio, J. Pohjalainen, and P. Alku, “Evaluation of an artificial speech bandwidth extension method in three languages,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 6, pp. 1124–1137, 2008.
View at: Publisher Site | Google Scholar
K.-T. Kim, M.-K. Lee, and H.-G. Kang, “Speech bandwidth extension using temporal envelope modeling,” IEEE Signal Processing Letters, vol. 15, pp. 429–432, 2008.
View at: Publisher Site | Google Scholar
J. H. Park, H. K. Kim, M. B. Kim, and S. R. Kim, “A user voice reduction algorithm based on binaural signal separation for portable digital imaging devices,” IEEE Transactions on Consumer Electronics, vol. 58, no. 2, pp. 679–684, 2012.
View at: Publisher Site | Google Scholar
J. A. Kang, C. J. Chun, H. K. Kim, M. B. Kim, and S. R. Kim, “A smart background music mixing algorithm for portable digital imaging devices,” IEEE Transactions on Consumer Electronics, vol. 57, no. 3, pp. 1258–1263, 2011.
View at: Publisher Site | Google Scholar
Y. R. Oh, J. S. Yoon, H. K. Kim, M. B. Kim, and S. R. Kim, “A voice-driven scene-mode recommendation service for portable digital imaging devices,” IEEE Transactions on Consumer Electronics, vol. 55, no. 4, pp. 1739–1747, 2009.
View at: Publisher Site | Google Scholar
J. Herre and M. Dietz, “MPEG-4 high-efficiency AAC coding,” IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 137–142, 2008.
View at: Publisher Site | Google Scholar
M. Tammi, L. Laaksonen, A. Rämö, and H. Toukomaa, “Scalable superwideband extension for wideband coding,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), pp. 161–164, Taiwan, April 2009.
View at: Google Scholar
B. Geiser, P. Jax, P. Vary et al., “Bandwidth extension for hierarchical speech and audio coding in ITU-T Rec. G.729.1,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2496–2509, 2007.
View at: Publisher Site | Google Scholar
H. Ehara, T. Morii, and K. Yoshida, “Predictive vector quantization of wideband LSF using narrowband LSF for bandwidth scalable coders,” Speech Communication, vol. 49, no. 6, pp. 490–500, 2007.
View at: Publisher Site | Google Scholar
Y. H. Lee, H. K. Kim, M. S. Lee, and D. Y. Kim, “Bandwidth extension of a narrowband speech coder for music delivery over IP,” Lecture Notes in Artificial Intelligence, vol. 4413, pp. 198–208, 2007.
View at: Google Scholar
T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proceedings of the IEEE, vol. 88, no. 4, pp. 451–515, 2000.
View at: Publisher Site | Google Scholar
EBU Tech Document 3253, Sound Quality Assessment Material (SQAM), 1988.
ITU-T WP3/16, Processing Test Plan for the ITU-T Joint (G.718/G.729.1) SWB/Stereo Extension Optimisation/Characterization Phase, 2008.

Copyright

Copyright © 2013 Young Han Lee and Seung Ho Choi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1744

Downloads

1272

Citations