Feature Fusion Based Audio-Visual Speaker Identification Using Hidden Markov Model under Different Lighting Variations
The aim of the paper is to propose a feature fusion based Audio-Visual Speaker Identification (AVSI) system with varied conditions of illumination environments. Among the different fusion strategies, feature level fusion has been used for the proposed AVSI system where Hidden Markov Model (HMM) is used for learning and classification. Since the feature set contains richer information about the raw biometric data than any other levels, integration at feature level is expected to provide better authentication results. In this paper, both Mel Frequency Cepstral Coefficients (MFCCs) and Linear Prediction Cepstral Coefficients (LPCCs) are combined to get the audio feature vectors and Active Shape Model (ASM) based appearance and shape facial features are concatenated to take the visual feature vectors. These combined audio and visual features are used for the feature-fusion. To reduce the dimension of the audio and visual feature vectors, Principal Component Analysis (PCA) method is used. The VALID audio-visual database is used to measure the performance of the proposed system where four different illumination levels of lighting conditions are considered. Experimental results focus on the significance of the proposed audio-visual speaker identification system with various combinations of audio and visual features.
Human speaker identification is bimodal in nature [1, 2]. In a face-to-face conversation, we listen to what others say and at the same time observe their lip movements, facial expressions, and gestures. Especially, if we have a problem in listening due to environmental noise, the visual information plays an important role for speech understanding . Even in the clean environment, speech recognition performance is improved when the talking face is visible . Generally, it is true that audio-only speaker identification system is not sufficiently adequate to meet the variety of user requirements for person identification. The AVSI system promises to alleviate some of the drawbacks encountered by audio-only identification. Visual speech information can play an important role in the improvement of natural and robust human-computer interaction [5, 6]. Indeed, various important human-computer components, such as speaker identification, verification , localization , speech event detection , speech signal separation , coding , video indexing and retrieval , and text-to-speech , have been shown to benefit from the visual channel .
Audio-visual identification system can significantly improve the performance of a biometric system besides improving population coverage, deterring spoof attacks, increasing the degrees of freedom, and reducing the failure-to-enroll rate. Although the storage requirements, processing time, and the computational demands of audio-visual system are much higher than audio-only system for speaker identification, effective integration of audio and visual features can remove or reduce most of the mentioned problems. Fusion of audio and visual features is an important fusion strategy which can improve system performance of AVSI system. In feature level fusion, sufficient information can exist compared to match score, rank, and decision level fusions . As a result, it can be expected that feature level fusion can achieve greater performance over other fusion strategies which have been incorporated in the proposed AVSI system.
Most published works in the areas of AVSI system focus on decision level fusion strategy for noiseless environments [16–19] and very few research works introduce noisy environmental conditions [20, 21]. The aim of this work is to use feature fusion for the AVSI system under different illumination levels of lighting conditions. Different audio and visual features are combined in multiple levels in the proposed system. The subsequent sections of the paper focus on the proposed block diagram, feature extraction of the speech and facial features, fusion of multimodal audio and visual feature vectors, dimensionality reduction of multiple features, classification by using HMM, and performance analysis of the proposed AVSI system.
2. Paradigm of the Proposed Audio-Visual Speaker Identification System
The architecture of the proposed audio-visual feature fusion based speaker identification system is shown in Figure 1. The MFCC and LPCC based audio features and Appearance and Shape based facial features are concatenated separately, and finally the audio and visual based features are fused to feed the HMM classifier. Normalization technique has been applied at different times to normalize the features. PCA has been used to reduce the dimension of the speech and facial feature vector in such a way that the principal components of the original features retain their attributes sufficiently.
3. Audio Feature Extraction and Fusion
To capture the speech signal, sampling frequency of 11025 Hz, sampling resolution of 16 bits, monorecording channel, and recorded file format have been considered. Wiener filter has been used to remove the background noise from the original speech utterances [22, 23]. Speech end points detection and silence part removal algorithm have been used to detect the presence of speech and to remove pulse and silences in a background noise [24, 25]. To detect word boundary, the frame energy is computed using the sort-term log energy equation :
Preemphasis has been used to balance the spectrum of voiced sounds that have a steep roll-off in the high frequency region. The transfer function of the FIR filter in the -domain is  where is the preemphasis parameter.
Frame blocking has been performed with an overlapping of 25% to 75% of the frame size. Typically a frame length of 10–30 milliseconds has been used. The purpose of the overlapping analysis is that each speech sound of the input sequence would be approximately centered at some frames .
From different types of windowing techniques, Hamming window has been used for this system. The purpose of using windowing is to reduce the effect of the spectral artifacts that result from the framing process . The Hamming window can be defined as follows:
To extract the features from the speech utterances, various types of standard speech feature extraction techniques  such as RCC, MFCC, ΔMFCC, ΔΔMFCC, LPC, and LPCC have been applied. After computation of the speech features, 12 frames are extracted where each frame contains 36 features. To normalize the feature value of each frame, feature mean normalization method has been applied. In this way, dimensional features are extracted from each speech utterance. Finally, 432 dimensional MFCC and 432 dimensional LPCC based speech features are extracted from the speech utterances. PCA based dimension reduction has been performed where 432 dimensional feature vectors have been converted into 120 dimension ones. Lastly, MFCC and LPCC based feature vectors are concatenated to produce 240 dimensional MFCC-LPCC based audio features.
4. Visual Feature Extraction and Fusion
High-quality digital camera has been used to capture the face images. After acquisition of face image, Stams  Active Appearance Model (ASM) has been used to detect the facial features. Then the binary image has been taken. The region of interest (ROI) has been chosen according to the ROI selection algorithm [30, 31]. Lastly, the background noise has been eliminated  and finally appearance based facial features have been found. The procedure of the facial image preprocessing parts is shown in Figure 2. To reduce the dimensionality of the facial feature vector, PCA has been used.
5000 dimensional appearance based and 176 dimensional shape based features have been computed by the above mentioned process. Raw facial features, that is, 5000 dimensional appearance feature vectors and 176 dimensional shape feature vectors have been converted into 192 and 14 dimension ones, respectively. Before applying PCA, min-max normalization technique has been used using the following equation: where is the function which generates .
The min-max technique is effective when the minimum and the maximum values of the component feature values are known previously. In cases where such information is not available, an estimate of these parameters has to be obtained from the available sample training data . Finally, 192 dimensional appearances and 14 dimensional shape features are concatenated which is mentioned as visual feature fusion.
5. Audio-Visual Feature Fusion and HMM Classification
240 dimensional audio feature vectors and 206 dimensional visual feature vectors are fused to produce 446 dimensional audio-visual feature vectors. Since the dimension of audio-visual feature vector is large enough, PCA has been used to reduce the dimension to 220. The reduced audio-visual features are finally fed to HMM learning and classification model.
In HMM training phase, for each face , an ergodic discrete HMM (DHMM), has been built . The model parameters have been estimated to optimize the likelihood of the training set observation vectors for the th face by using Baum-Welch algorithm. The Baum-Welch reestimation formula has been considered as follows [34, 35]: where
In DHMM testing phase, for each unknown face to be recognized, the procedure includes:(i)measurement of the observation sequence, , via a feature analysis of the speech corresponding to a face,(ii)transformation of the continuous values of into integer values,(iii)calculation of model likelihoods for all possible models, ,(iv)declaration of the face as person whose model likelihood is highest; that is,
In this proposed work, the probability computation step has been performed using Baum’s Forward-Backward algorithm .
6. Performance Analysis of the Proposed System
VALID audio-visual database  has been used to measure the performance of the proposed speaker identification system. Artificial white Gaussian noise has added to the original clean speech utterances to simulate various SNR levels. The models have been trained at clean speech utterances and tested with utterances under SNR level ranging from 0 to 30 dB at an interval of 5 dB. VALID database contains one neutral and four different office environmental noisy speeches of each person. Out of the five speech utterances, one clean speech is used for learning and the four others are used for testing purpose where noises are artificially added from 0 to 30 dB an interval of 5 dB.
Experimental results of the audio-visual feature fusion based identification have been captured in various dimensions. Since MFCC and LPCC based audio features and appearance and shape based visual features are considered for the audio-visual feature fusion based speaker identification system, experimental results are examined according to various combinations of MFCC, LPCC, and MFCC-LPCC based features of audio modality and appearance, shape, and appearance-shape based features of visual modality.
6.1. Optimum Value Selection on the Number of Hidden States of DHMM,
Since the number of hidden states of DHMM affects the performance of the audio-visual feature fusion based speaker identification system, results are taken according to different combinations of the audio and visual features which are shown in Figure 3. In Figure 3(a), optimum value of the number of hidden states is shown according to appearance based facial features with the combination of MFCC, LPCC, and both MFCC and LPCC based audio features separately. The highest speaker identification rate of 94% has been found at 15 numbers of hidden states when using appearance, MFCC, and LPCC based features. The identification rate of shape based facial features with MFCC, LPCC, and combined MFCC and LPCC based audio features are shown in Figure 3(b) according to the number of hidden states. When the number of hidden states is 13, the highest identification rate of 95% has been found. Figure 3(c) shows the results after setting up the number of hidden states of different combination of appearance and shape based features with MFCC and LPCC.
6.2. Performance Analysis of Different Variations of Audio-Visual Feature Fusion
To measure the performance of the audio-visual feature fusion based speaker identification system, different experiments have been performed, and we measured the performance from different angles which are shown in the following subsections.
6.2.1. Experiment of Individual Feature Based Performance
Individual performances of each of the feature extraction techniques are examined, which are shown in Table 1. Results are shown according to different audio SNRs rates where the visual feature is not affected by the audio noise. From the table, it is shown that combined appearance and shape based facial feature can give greater identification rate than individual visual features. But the shape based features have better identification than appearance based for face identification. On the other hand, for audio identification performance, rate is varied according to different SNRs. Identification rate increases with the decrease of SNRs rate. At 0 dB noise, the identification rate is lower and the rate is higher at 30 dB audio noise. From the result, it has been shown that the MFCC based result is better than LPCC based feature, and combined MFCC and LPCC based features can give much better performance than MFCC only and LPCC only features.
6.2.2. Experiment on Different Audio Features with Appearance Based Facial Features
Various audio features such as MFCC, LPCC, and combined MFCC and LPCC features are concatenated with the appearance based facial feature. HMM learning and classification techniques are used to measure the performance of the system. The results are examined according to various SNRs ranging from 0 to 30 dB which are shown in Figure 4. The highest speaker identification rate of 92% has been found at appearance based features with combined MFCC and LPCC based audio features according to SNR rate of 30 dB.
6.2.3. Experiment on Different Audio Features with Shape Based Facial Features
Performance measurements between different audio features with shape based facial features are shown in Figure 5. It shows that the combination of MFCC and LPCC based audio features can give best result with shape based visual feature. The highest identification rate was found to be 93.67% at signal-to-noise ratio of 30 dB.
6.2.4. Experiment on Audio Features with Combined Appearance and Shape Based Facial Feature
Finally, the appearance and shape based facial features are combined with the MFCC and LPCC based audio features and the corresponding measurements of the performances are shown in Figure 6. In this experiment, we observe that the appearance and shape based combined facial features give constant identification rate, and it is 93%. The identification rate of other combinations of audio and visual features has different identification rates with the variation of various SNRs. In all of the cases, the audio-visual feature fusion based identification rate increases with the decreases of audio SNRs. At SNR 30 dB, the identification rate was 93.67%, 92%, and 95% when applying MFCC, LPCC, and MFCC-LPCC with combined appearance and shape based facial feature, respectively.
7. Conclusions and Observations
Experimental results and performance analysis reveal that the proposed audio-visual feature fusion based speaker identification system can perform better compared to any other single feature based approach. Four different varied light environments are considered for facial image. Artificial white Gaussian audio noises of 0 to 30 dB at an interval of 5 dB are added to the clean speech signals. The proposed system is capable enough to work at different variations of lighting environments. But the system performance may degrade at other noisy environment, which may be covered in a future research paper of the authors. In that case, the proposed feature-fusion approach may be applied before the decision-fusion approach which may further increase the efficiency of the audio-visual speaker identification rate.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
D. G. Stork and M. E. Hennecke, Eds., Speechreading by Humans and Machines, Springer, Berlin, Germany, 1996.
R. Campbell, B. Dodd, and D. Burnham, Eds., Hearing by Eye II, Psychology Press, Hove, UK, 1998.
G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features for audio-visual LVCSR,” in Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 165–168, May 2001.View at: Google Scholar
P. de Cuetos, C. Neti, and A. W. Senior, “Audio-visual intent-to-speak detection for human-computer interaction,” in Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 2373–2376, Istanbul, Turkey, June 2000.View at: Google Scholar
D. Sodoyer, J.-L. Schwartz, L. Girin, J. Klinkisch, and C. Jutten, “Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1165–1173, 2002.View at: Publisher Site | Google Scholar
E. Foucher, L. Girin, and G. Feng, “Audiovisual speech coder: using vector quantization to exploit the audio/video correlation,” in Proceedings of the Conference on Audio-Visual Speech Processing, pp. 67–71, Terrigal, Australia, December 1998.View at: Google Scholar
J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. Wong, “Integration of multimodal features for video scene classification based on HMM,” in Proceedings of the IEEE 3rd Workshop on Multimedia Signal Processing, pp. 53–58, Copenhagen, Denmark, September 1999.View at: Google Scholar
G. Potamianos, C. Neti, and S. Deligne, “Joint audio-visual speech processing for recognition and enhancement,” in Proceedings of the Auditory-Visual Speech Processing Tutorial and Research Workshop (AVSP '03), pp. 95–104, Saint-Jorioz, France, September 2003.View at: Google Scholar
A. Rogozan and P. Deléglise, “Adaptive fusion of acoustic and visual sources for automatic speech recognition,” Speech Communication, vol. 26, no. 1-2, pp. 149–161, 1998.View at: Google Scholar
J.-S. Lee and C. H. Park, Adaptive Decision Fusion for Audio-Visual Speech Recognition. Speech Recognition, 2008.
Md. Rabiul Islam and Md. Fayzur Rahman, “Likelihood ratio based score fusion for audio-visual speaker identification in challenging environment,” International Journal of Computer Applications, vol. 6, no. 7, pp. 6–11, 2010.View at: Google Scholar
L. Girin, G. Feng, and J. L. Schwartz, “Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), pp. 1005–1008, 1998.View at: Google Scholar
N. Wiener and R. E. A. C. Paley, Fourier Transforms in the Complex Domains, American Mathematical Society, Providence, RI, USA, 1934.
K. Kitayama, M. Goto, K. Itou, and T. Kobayashi, “Speech starter: noise-robust endpoint detection by using filled pauses,” in Proceedings of the Eurospeech, pp. 1237–1240, Geneva, Switzerland, 2003.View at: Google Scholar
S. E. Bou-Ghazale and K. Assaleh, “A robust endpoint detection of speech for noisy environments with application to automatic speech recognition,” in Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP '02), vol. 4, pp. 3808–3811, May 2002.View at: Google Scholar
L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A real-time text-independent speaker identification system,” in Proceedings of the 12th International Conference on Image Analysis and Processing, pp. 632–637, IEEE Computer Society Press, Mantova, Italy, September 2003.View at: Google Scholar
S. Milborrow, Locating facial features with active shape models [dissertation], Faculty of Engineering, University of Cape Town, Cape Town, South Africa, 2007.
R. Herpers, G. Verghese, K. Derpains, and R. McCready, “Detection and tracking of face in real environments,” in Proceedings of the IEEE International Workshop on Recognition, Analysis and Tracking of Face and Gesture in Real-Time Systems, pp. 96–104, Corfu, Greece, 1999.View at: Google Scholar
J. Daugman, “Face detection: a survey,” Computer Vision and Image Understanding, vol. 83, no. 3, pp. 236–274, 2001.View at: Google Scholar
R. C. Gonzalez and R. E. Woods, Digital Image Processing, Addison-Wesley, 2002.
J.-S. Lee and C. H. Park, Speech Recognition, Technologies and Applications, I-Tech, Vienna, Austria, 2008.
P. A. Devijver, “Baum's forward-backward algorithm revisited,” Pattern Recognition Letters, vol. 3, no. 6, pp. 369–373, 1985.View at: Google Scholar
N. A. Fox, B. A. O'Mullane, and R. B. Reilly, “VALID: a new practical audio-visual database, and comparative results,” in Audio - and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science, pp. 777–786, 2005.View at: Google Scholar