Table of Contents Author Guidelines Submit a Manuscript
Advances in Multimedia
Volume 2013, Article ID 175745, 21 pages
http://dx.doi.org/10.1155/2013/175745
Research Article

Real-Time Audio-Visual Analysis for Multiperson Videoconferencing

1Idiap Research Institute, 1920 Martigny, Switzerland
2Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, 69621 Lyon, France
3Fraunhofer IIS, 91058 Erlangen, Germany

Received 28 February 2013; Accepted 21 June 2013

Academic Editor: Alexander Loui

Copyright © 2013 Petr Motlicek et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. M. Falelakis, R. Kaiser, W. Weiss, and M. F. Ursu, “Reasoning for video-mediated group communication,” in Proceedings of the 12th IEEE International Conference on Multimedia and Expo (ICME '11), Barcelona, Spain, July 2011. View at Publisher · View at Google Scholar · View at Scopus
  2. J. Engdegård, B. Resch, C. Falch et al., “Spatial audio object coding (SAOC)—the upcoming MPEG standard on parametric object based audio coding,” in Proceedings of the 124th AES Convention, Amsterdam, The Netherlands, 2008.
  3. J. Carletta, S. Ashby, S. Bourban et al., “The AMI meeting corpus,” in Proceedings of the Machine Learning for Multimodal Interaction (MLMI '05), Edinburgh, UK, 2005.
  4. Z. Khan, T. Balch, and F. Dellaert, “MCMC-based particle filtering for tracking a variable number of interacting targets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1805–1819, 2005. View at Publisher · View at Google Scholar · View at Scopus
  5. J. Ajmera, Robust audio segmentation [Ph.D. thesis], Ecole Polytechnique Federale de Lausanne (EPFL), 2004.
  6. J. Pardo, X. Anguera, and C. Wooters, “Speaker diarization for multi-microphone meetings using only between-channel differences,” in Proceedings of the Machine Learning for Multimodal Interaction (MLMI '06), Bethesda, Md, USA, 2006.
  7. D. Korchagin, “Audio spatio-temporal fingerprints for cloudless real-time hands-free diarization on mobile devices,” in Proceedings of the 3rd Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA '11), pp. 25–30, Edinburgh, UK, June 2011. View at Publisher · View at Google Scholar · View at Scopus
  8. C. Sanderson and K. K. Paliwal, “Information fusion and person verification using speech and face information,” Idiap Research Report IDIAP-RR 02-33, 2002. View at Google Scholar
  9. M. Slaney and M. Covell, “Facesync: a linear operator for measuring synchronization of video facial images and audio tracks,” in Proceedings of the Neural Information Processing Systems, pp. 814–820, 2000.
  10. D. Korchagin, P. Motlicek, S. Duffner, and H. Bourlard, “Just-in-time multimodal association and fusion from home entertainment,” in Proceedings of the 12th IEEE International Conference on Multimedia and Expo (ICME '11), Barcelona, Spain, July 2011. View at Publisher · View at Google Scholar · View at Scopus
  11. J. Hershey and J. Movellan, “Audio vision: using audio-visual synchrony to locate sounds,” in Proceedings of the Neural Information Processing Systems, pp. 813–819, 1999.
  12. H. Nock, G. Iyengar, and C. Neti, “Speaker localisation using audio-visual synchrony: an empirical study,” in Proceedings of the 2nd International Conference on Image and Video Retrieval (CIVR '03), Urbana-Champaign, Ill, USA, 2003.
  13. M. Gurban and J. Thiran, “Multimodal speaker localization in a probabilistic framework,” in Proceedings of the European Signal Processing Conference (EUSIPCO '06), Florence, Italy, 2006.
  14. K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich, and J. Yamato, “A realtime multimodal system for analyzing group meetings by combining face pose tracking and s peaker diarization,” in Proceedings of the 10th International Conference on Multimodal Interfaces (ICMI '08), pp. 257–264, Chania, Greece, October 2008. View at Publisher · View at Google Scholar · View at Scopus
  15. V. Pulkki, “Spatial sound reproduction with directional audio coding,” Journal of the Audio Engineering Society, vol. 55, no. 6, pp. 503–516, 2007. View at Google Scholar · View at Scopus
  16. S. Rickard and Ö. Yilmaz, “On the approximate W-disjoint orthogonality of speech,” in Proceedings of the IEEE International Conference on Acustics, Speech, and Signal Processing (ICASSP '02), pp. I/529–I/532, May 2002. View at Scopus
  17. O. Thiergart, G. Del Galdo, M. Prus, and F. Kuech, “Three-dimensional sound field analysis with directional audio coding based on signal adaptive parameter estimators,” in Proceedings of the AES 40th International Conference on Spatial Audio: Sense the Sound of Space, Tokyo, Japan, October 2010. View at Scopus
  18. O. Thiergart, R. Schultz-Amling, G. Del Galdo, D. Mahne, and F. Kuech, “Localization of sound sources in reverberant environments based on directional audio coding parameters,” in Proceedings of the 127th AES Convention, New York, NY, USA, 2009.
  19. M. Kallinger, H. Ochsenfeld, G. Del Galdo et al., “A spatial filtering approach for directional audio coding,” in Proceedings of the 126th AES Convention, 2009.
  20. J. Herre, C. Falch, D. Mahne, G. Del Galdo, M. Kallinger, and O. Thiergart, “Interactive teleconferencing combining spatial audio object coding and DirAC technology,” in Proceedings of the 128th AES Convention, London, UK, 2010.
  21. S. Duffner and J.-M. Odobez, “A track creation and deletion framework for long-term online multi-face tracking,” IEEE Transactions on Image Processing, vol. 22, no. 1, pp. 272–285, 2013. View at Google Scholar
  22. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 511–518, December 2001. View at Scopus
  23. C. Scheffler and J. M. Odobez, “Joint adaptive colour modelling and skin, hair and clothing segmentation using coherent probabilistic index maps,” in Proceedings of the British Machine Vision Conference, 2011.
  24. E. Ricci and J.-M. Odobez, “Learning large margin likelihoods for realtime head pose tracking,” in Proceedings of the IEEE International Conference on Image Processing (ICIP '09), pp. 2593–2596, November 2009. View at Publisher · View at Google Scholar · View at Scopus
  25. S. O. Ba and J.-M. Odobez, “Recognizing visual focus of attention from head pose in natural meetings,” IEEE Transactions on Systems, Man, and Cybernetics B, vol. 39, no. 1, pp. 16–33, 2009. View at Publisher · View at Google Scholar · View at Scopus
  26. D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz, and C. Jutten, “An analysis of visual speech information applied to voice activity detection,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), pp. I601–I604, May 2006. View at Scopus
  27. S. Siatras, N. Nikolaidis, M. Krinidis, and I. Pitas, “Visual lip activity detection and speaker detection using mouth region intensities,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 1, pp. 133–137, 2009. View at Publisher · View at Google Scholar · View at Scopus
  28. H. Hung and S. O. Ba, “Speech/non-speech detection in meetings from automatically extracted low resolution visual features,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '10), Dallas, Tex, USA, 2010.
  29. C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, pp. 320–327, 1976. View at Google Scholar · View at Scopus
  30. J. DiBiase, H. Silverman, and M. Brandstein, “Robust localization in reverberant rooms,” in Microphone Arrays, M. Brandstein and D. Ward, Eds., chapter 8, Springer, 2001. View at Google Scholar
  31. P. N. Garner, “Silence models in weighted finite-state transducers,” in Proceedings of the 9th Annual Conference of the International Speech Communication Association (INTERSPEECH '08), pp. 1817–1820, Brisbane, Australia, September 2008. View at Scopus
  32. GSM 06. 94, Digital cellular telecommunications system (Phase 2+), “Voice activity detector (VAD) for adaptive multi rate (AMR) speech traffic channels,” 1999.
  33. J. Dines, J. Vepa, and T. Hain, “The segmentation of multi-channel meeting recordings for automatic speech recognition,” in Proceedings of the INTERSPEECH and 9th International Conference on Spoken Language Processing (INTERSPEECH ICSLP '06), pp. 1213–1216, September 2006. View at Scopus
  34. P. N. Garner, J. Dines, T. Hain et al., “Real-time ASR from meetings,” in Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH '09), pp. 2119–2122, Brighton, UK, September 2009. View at Scopus
  35. F. Kuech, M. Kallinger, M. Schmidt, C. Faller, and A. Favrot, “Acoustic echo suppression based on separation of stationary and non-stationary echo components,” in Proceedings of the Acoustic Echo and Noise Control, Seattle, Wash, USA, 2008.
  36. S. Duffner, P. Motlicek, and D. Korchagin, “The TA2 database: a multimodal database from home entertainment,” in Proceedings of the Signal Acquisition and Processing, Singapore, 2011.
  37. G. Lathoud and I. A. McCowan, “A sector-based approach for localization of multiple speakers with microphone arrays,” in Proceedings of the Workshop on Statistical and Perceptual Audio Processing (SAPA '04), Jeju, Republic of Korea, 2004.
  38. D. Vijayasenan, F. Valente, and H. Bourlard, “An information theoretic approach to speaker diarization of meeting data,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 7, pp. 1382–1393, 2009. View at Publisher · View at Google Scholar · View at Scopus
  39. A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET curve in assessment of detection task performance,” in Proceedings of the European Conference on Speech Communication and Technology (Eurospeech '97), vol. 4, pp. 1895–1898, Rhodes, Greece, 1997.
  40. EBU Technical Recommendation, “MUSHRA-EBU method for subjective listening tests of intermediate audio quality,” Doc. B/AIM022, 1999.