Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014 (2014), Article ID 627189, 12 pages
http://dx.doi.org/10.1155/2014/627189
Research Article

Voice Quality Modelling for Expressive Speech Synthesis

1Computer Science, Multimedia and Telecommunication Studies, Universitat Oberta de Catalunya (UOC), Rambla del Poblenou 156, 08018 Barcelona, Spain
2Grup de Recerca en Tecnologies Mèdia (GTM), Universitat Ramon Llull, La Salle, Quatre Camins 2, 08022 Barcelona, Spain

Received 25 August 2013; Accepted 13 November 2013; Published 22 January 2014

Academic Editors: L. Angrisani and J. A. Apolinario Jr.

Copyright © 2014 Carlos Monzo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis et al., “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, 2001. View at Publisher · View at Google Scholar · View at Scopus
  2. S. Planet, I. Iriondo, J.-C. Socoró, C. Monzo, and J. Adell, “GTM-URL contribution to the INTERSPEECH 2009 Emotion Challenge,” in Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH '09), pp. 316–319, Brighton, UK, September 2009. View at Scopus
  3. C. Drioli, G. Tisato, P. Cosi, and F. Tesser, “Emotions and voice quality: experiments with sinusoidal modeling,” in Proceedings of the ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis and Synthesis (VOQUAL '03), pp. 127–132, Geneva, Switzerland, 2003.
  4. O. Turk, M. Schröder, B. Bozkurt, and L. M. Arslan, “Voice quality interpolation for emotional text-to-speech synthesis,” in Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech '05), pp. 797–800, Lisbon, Portugal, September 2005. View at Scopus
  5. D. Erro, Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models [Ph.D. thesis], Universitat Politècnica de Catalunya, Barcelona, Spain, 2008.
  6. M. Schroder, “Emotional speech synthesis: a review,” in Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech '01), pp. 561–564, 2001.
  7. C. Gobl, E. Bennett, and A. Ní Chasaide, “Expressive synthesis: how crucial is voice quality?” in Proceedings of the IEEE Workshop on Speech Synthesis, no. 11-13, pp. 91–94, 2002.
  8. J. P. Cabral and L. C. Oliveira, “Pitch-synchronous time-scaling for prosodic and voice quality transformations,” in Proceedings of the 9th European Conference on Speech Communication and Technology, pp. 1137–1140, Lisbon, Portugal, September 2005. View at Scopus
  9. I. Iriondo, S. Planet, J.-C. Socoró, E. Martínez, F. Alías, and C. Monzo, “Automatic refinement of an expressive speech corpus assembling subjective perception and automatic classification,” Speech Communication, vol. 51, no. 9, pp. 744–758, 2009. View at Publisher · View at Google Scholar · View at Scopus
  10. C. Monzo, A. Calzada, I. Iriondo, and J. C. Socoró, “Expressive speech style transformation: voice quality and prosody modification using a harmonic plus noise model,” in Speech Prosody, Chicago, Ill, USA, 2010. View at Google Scholar
  11. T. Banziger and K. R. Scherer, “A study of perceived vocal features in emotional speech,” in Proceedings of the ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis and Synthesis (VOQUAL ’03), pp. 169–172, Geneva, Switzerland, 2003.
  12. C. Gobl and A. Ní Chasaide, “The role of voice quality in communicating emotion, mood and attitude,” Speech Communication, vol. 40, no. 1-2, pp. 189–212, 2003. View at Publisher · View at Google Scholar · View at Scopus
  13. C. Monzo, F. Alías, I. Iriondo, X. Gonzalvo, and S. Planet, “Discriminating expressive speech styles by voice quality parameterization,” in Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS '07), pp. 2081–2084, Saarbrucken, Germany, 2007.
  14. J. Laroche, Y. Stylianou, and E. Moulines, “HNS: speech modification based on a harmonic+noise model,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '93), pp. 550–553, Minneapolis, Minn, USA, April 1993. View at Scopus
  15. Y. Stylianou, J. Laroche, and E. Moulines, “High-quality speech modification based on a harmonic + noise model,” in Proceedings of the European Conference on Speech Communication and Technology (Eurospeech '95), pp. 451–454, 1995.
  16. J. E. Cahn, Generating expression in synthesized speech [M.S. thesis], Massachusetts Institute of Technology, 1989.
  17. I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion,” Journal of the Acoustical Society of America, vol. 93, no. 2, pp. 1097–1108, 1993. View at Publisher · View at Google Scholar · View at Scopus
  18. I. R. Murray and J. L. Arnott, “Implementation and testing of a system for producing emotion-by-rule in synthetic speech,” Speech Communication, vol. 16, no. 4, pp. 369–390, 1995. View at Google Scholar · View at Scopus
  19. J. Yamagishi, T. Masuko, and T. Kobayashi, “Hmm-based expressive speech synthesis—towards tts with arbitrary speaking styles and emotions,” in Proceedings of the Special Workshop in MAUI (SWIM), Lectures by Masters in Speech Processing, Conference CD-ROM, 1. 13, p. 4, 2004.
  20. I. Esquerra, “Sntesis de habla emocional por seleccion de unidades,” in Proceedings of the IV Jornadas en Tecnologa del Habla, pp. 161–165, Zaragoza, Spain, 2006.
  21. I. Iriondo, J. C. Socoró, and F. Alías, “Prosody modelling of spanish for expressive speech synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), pp. 821–824, Honolulu, Hawaii, USA, April 2007. View at Publisher · View at Google Scholar · View at Scopus
  22. M. Schroder, Speech and emotion research: an overview of research frameworks and a dimensional approach to emotional speech synthesis [Ph.D. thesis], PHONUS 7, Research Report of the Institute of Phonetics, Saarland University, Saarbrucken, Germany, 2004.
  23. N. Montoya, “El papel de la voz en la publicidad audiovisual dirigida a los niños,” Zer. Revista de Estudios de Comunicación, no. 4, pp. 161–177, 1998. View at Google Scholar
  24. N. Campbell, “Databases of emotional speech,” in Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, pp. 34–38, Northern Ireland, UK, 2000.
  25. H. François and O. Boeffard, “The greedy algorithm and its application to the construction of a continuous speech database,” in Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC '02), vol. 5, Las Palmas, Spain, 2002.
  26. H. E. Pérez, “Frecuencia de fonemas,” Revista Electronica de Tecnologa Del Habla (E-RTH), no. 1, 2003. View at Google Scholar
  27. D. Talkin, “A Robust Algorithm for Pitch Tracking (RAPT),” in Speech Coding and Synthesis, chapter 14, pp. 495–518, Elsevier Science, Amsterdam, The Netherlands, 1995. View at Google Scholar
  28. F. Alías, C. Monzo, and J. C. Socoró, “A pitch marks filtering algorithm based on restricted dynamic programming,” in Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), pp. 1698–1701, Pittsburgh, Pa, USA, September 2006. View at Scopus
  29. R. Kohavi and F. Provost, “Glossary of terms: special issue on applications of machine learning and the knowledge discovery process,” Machine Learning, vol. 30, no. 2-3, pp. 271–274, 1998. View at Google Scholar
  30. E. Navas, I. Hernáez, and I. Luengo, “An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1117–1127, 2006. View at Publisher · View at Google Scholar · View at Scopus
  31. A. Calzada, Expressive Synthesis based on Harmonic plus Stochastic model [Diploma of Advanced Studies], Universitat Ramon Llull, 2010.
  32. P. Depalle and T. Helie, “Extraction of spectral peak parameters using a short-time Fourier transform modeling and no sidelobe windows,” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '97), p. 4, October 1997. View at Scopus
  33. J. Wells, “Sampa computer readable phonetic alphabet,” in Handbook of Standards and Resources for Spoken, D. Gibbon, R. Moore, and R. R. Winski, Eds., Part IV, section B, Mouton de Gruyter, Berlin, Germany, 1997. View at Google Scholar
  34. C. Monzo, I. Iriondo, and E. Martínez, “Procedimiento para la medida y la modificacion del jitter y del shimmer aplicado a la síntesis del habla expresiva,” in Proceedings of the V Jornadas en Tecnologa del Habla, pp. 58–61, Bilbao, Spain, 2008.
  35. P. Boersma, “Praat, a system for doing phonetics by computer,” Glot International, vol. 5, no. 9-10, pp. 341–345, 2001. View at Google Scholar
  36. E. Keller, “The analysis of voice quality in speech processing,” in Nonlinear Speech Modeling and Applications, vol. 3445 of Lecture Notes in Computer Science (LNCS), pp. 54–73, 2005. View at Google Scholar
  37. C. Monzo, Modelado de la cualidad de la voz para la síntesis del habla expresiva [Ph.D. thesis], Universitat Ramon Llull, 2010.
  38. ITU-P. 800, “Methods for subjective determination of transmission quality, Recommendation P. 800 International Telecommunication Union (ITU) Std,” 1996.
  39. M. Frigge, D. C. Hoaglin, and B. Iglewicz, “Some implementations of the boxplot,” American Statistician, vol. 43, no. 1, pp. 50–54, 1989. View at Google Scholar
  40. F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002. View at Publisher · View at Google Scholar · View at Scopus