Abstract

This paper deals with a new and improved approach of Back-propagation learning neural network based likelihood ratio score fusion technique for audio-visual speaker Identification in various noisy environments. Different signal preprocessing and noise removing techniques have been used to process the speech utterance and LPC, LPCC, RCC, MFCC, ΔMFCC and ΔΔMFCC methods have been applied to extract the features from the audio signal. Active Shape Model has been used to extract the appearance and shape based facial features. To enhance the performance of the proposed system, appearance and shape based facial features are concatenated and Principal Component Analysis method has been used to reduce the dimension of the facial feature vector. The audio and visual feature vectors are then fed to Hidden Markov Model separately to find out the log-likelihood of each modality. The reliability of each modality has been calculated using reliability measurement method. Finally, these integrated likelihood ratios are fed to Back-propagation learning neural network algorithm to discover the final speaker identification result. For measuring the performance of the proposed system, three different databases, that is, NOIZEUS speech database, ORL face database and VALID audio-visual multimodal database have been used for audio-only, visual-only, and audio-visual speaker identification. To identify the accuracy of the proposed system with existing techniques under various noisy environment, different types of artificial noise have been added at various rates with audio and visual signal and performance being compared with different variations of audio and visual features.

1. Introduction

Biometric authentication [1] has grown in popularity as a way to provide personal identification. Person’s identification is crucially significant in many applications and the hike in credit card fraud and identity thefts in recent years indicate that this is an issue of major concern in wider society. Individual passwords, pin identification, or even token based arrangement all have deficiencies that restrict their applicability in a widely networked society. Biometrics is used to identify the identity of an input sample when compared to a template, used in cases to identify specific people by certain characteristics. No single biometrics is expected to effectively satisfy the needs of all identification applications. A number of biometrics have been proposed, researched and evaluated for authentication applications. Each biometrics has its strengths and limitations, and accordingly, each biometrics appeals to a particular identification application [2]. Biometric characteristics can be divided in physiological and behavioral classes [3]. Physiological characteristics are related to the shape of the body and thus it varies from person to person. Fingerprints, face recognition, hand geometry, and iris recognition are some examples of this type of Biometrics. Behavioral characteristics are related to the behavior of a person. Some examples in this case are signature, keystroke dynamics, voice, and so on.

The Audio-Visual speaker identification system combines the speech and face biometric characteristics which mix the physiological and behavioral characteristics. There are different levels where the audio and visual features can be concatenated. Preclassification and post-classification are the two broad categories for information fusion in biometric system [4]. In preclassification, multimodal information is fused before going to the classifier decision. But, in postclassification, information is combined after the decision of multiple classifiers. In this proposed system, feature level fusion has been performed under preclassification approach. Appearance and shape based facial features are combined for visual identification result. Decision level fusion has been applied in the proposed system where audio and visual speaker identification decisions are concatenated to find out the final identification result. However, both feature level and decision level fusion are performed for the proposed system.

The rest of the paper is constructed as follows: Section 2 describes the literature review and the proposed system architecture, Section 3 focuses on the audio-only identification performance, visual-only identification has been elaborated in Section 4, Audio and visual reliability measurement techniques are focused on Section 5 and back-propagation learning neural network based likelihood ratio score fusion technique has been shown in Section 6. NOIZEUS speech database has been used to measure the performance of the Audio-Only speaker identification, visual-only performance has been populated using ORL database and overall system performance that is, back-propagation learning neural network score fusion based performance has been counted by applying VALID audio-visual multimodal database.

2. Literature Review and Proposed System Architecture

Since human speech is bimodal in nature [5, 6], visual speech information can play a vital role for the improvement of natural and robust human-computer interaction [711]. Most published works in the areas of speech recognition and speaker recognition focus on speech under the noiseless environments and few published works focus on speech under noisy conditions [1215]. Indeed, various important human-computer components, such as speaker identification, verification [16, 17], localization [18], speech event detection [19], speech signal separation [20], coding [21], video indexing and retrieval [22] and text-to-speech [23, 24] have been shown to benefit from the visual channel [25]. Adaptive weighting in decision fusion with acoustic and visual features from a given Audio-Visual speech datum, the recognized utterance class has been proposed [26]. The reliability of each audio and visual modality can be measured in various ways such as average absolute difference of loglikelihood [27], variance of loglikelihood [28], average difference of log-likelihood from the maximum [29], and inverse entropy of posterior probability [30]. Decision level information integration techniques have been developed where each biometric matcher individually decides on the best match based on the input presented to it. Methods like majority voting [31], behavior knowledge space [32], weighted voting based on the Dempster-Shafer theory of evidence [33], AND rule and OR rule [34], and so forth are some of the decision level fusion techniques proposed by different researchers.

The proposed architecture of the audio-visual speaker identification system is shown in Figure 1. Signal preprocessing and noise removing techniques have been applied after acquisition of the speech utterances. Then features are extracted using various standard speech feature extraction methods such as LPC, LPCC, RCC, MFCC, MFCC, and MFCC. Principal Component Analysis (PCA) has been used to reduce the dimensionality of the extracted feature vector. Now the reduced feature vector is feed to Discrete Hidden Markov Model (DHMM) to get the log likelihood of each speech modality. Reliability measurement method has been used to measure the reliability for audio signal. For visual identification, captured faces are preprocessed using different noise removing techniques and Active Shape Model (ASM) is used to extract the appearance and shape based features. These two different types of features are fused after applying feature normalization and PCA based dimensionality reduction techniques. The concatenations of these features are important in the sense when the appearance based feature is captured with noise (i.e., light variations) then shape based features can retain the performance on a satisfied level. This is also true when the shape based feature is captured by noise highly. By combining this approach, the proposed system performs very well especially in various lighting environmental conditions. Finally, log likelihood of visual modality has been evaluated using DHMM classification and reliability has been measured using the same reliability measurement technique like audio modality. Integrated weights of audio and visual reliability measurement are fed to the Back-propagation learning neural network algorithm to calculate the final speaker identification result.

Rogozan and Deléglise [26] developed a technique for combining different likelihoods of multilevel biometric identification. In this proposed system, BPN algorithm has been used to combine the likelihood of audio and visual modality to enhance the performance of audio-visual speaker identification. This is the main contribution of the proposed system. Experimental results show the superiority of BPN based approach over the Rogozan and Deléglise [26] method in terms of audio-visual speaker identification system.

3. Audio-Only Speaker Identification

3.1. Speech Signal Preprocessing and Feature Extraction

Speech signal preprocessing plays an important role for the efficiency of speaker identification. After capturing the speech utterances, wiener filter has been used to remove the background noise from the original speech utterances [35]. The wiener filter is a noise removing filter based on Fourier iteration. Its main advantage is the short computational time it takes to find a solution [36].

Let be the smear signal let and be the known response that causes the convolution. Then is related to by or where , , are Fourier Transform of , , and . Consider the following.

The second source of signal corruption is the unknown background noise . Therefore the measured signal is a sum of and

To deconvolve to find , simply divide by that is, in the absence of noise . To deconvolve where is present then one needs to find an optimum filter function or which filters out the noise and gives a signal by where is as close to the original signal as possible.

For to be similar to , their differences square is as close to zero as possible; that is, or is minimized.

Substituting the above three equations, the Fourier version becomes: after rearranging. The best filter is one where the above integral is a minimum at every value of . This is, when Now, , where , , and are the power spectrum of , , and . Therefore,

Figure 2(a) shows a sample signal with background noise and Figure 2(b) shows the signal after applying the wiener filter.

Speech end points detection and silence part removal algorithm have been used to detect the presence of speech and to remove pulse and silences in the speech utterance [37, 38] which is shown in Figure 3.

To detect word boundary, the frame energy is computed using the short-term log energy equation [39]

Preemphasis has been used to balance the spectrum of voiced sounds that have a steep roll-off in the high frequency region [38]. The transfer function of the FIR filter in the -domain is [40] where is the preemphasis parameter.

Frame blocking has been performed with an overlapping of 25% to 75% of the frame size. Typically a frame length of 10–30 milliseconds has been used. The purpose of the overlapping analysis is that each speech sound of the input sequence would be approximately centered at some frame [41].

From different types of windowing techniques, a hamming window has been used for this system. The purpose of using windowing is to reduce the effect of the spectral artifacts that results from the framing process [42]. The hamming window can be defined as follows [43]:

To extract the audio features, RCC, LPCC, MFCC, ΔMFCC, and ΔΔMFCC based various standard speech feature extraction techniques [44, 45] have been used to enhance the efficiency of the system because the quality of the system depends on the proper feature extracted values.

3.2. Experimental Results according to NOIZEUS Speech Database

NOIZEUS speech corpus [46, 47] has been used to calculate the accuracy of the audio-only speaker identification system which contains 30 IEEE sentences (produced by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. The noise was taken from the AURORA database and includes suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station noise. The sentences were originally sampled at 25 kHz and downsampled to 8 kHz [48].

To measure the performance of the system according to NOIZEUS speech database, one clean speech utterance has been used for learning and four different noisy speeches ranging from 0 dB to 20 dB with 5 dB interval are used for testing purpose. Tables 1, 2, 3, 4, 5, 6, 7, 8, and 9 show the results of audio-only speaker identification rate at different types of noisy environments with different SNRs.

Table 9 shows the overall average speaker identification rate for NOIZEUS speech corpus. From the table, it is easy to compare the performance among MFCC, ΔMFCC, ΔΔMFCC, RCC and LPCC methods for DHMM based audio-only speaker identification system. It is shown that ΔMFCC has greater performance 48.85% than any other methods such as MFCC, ΔΔMFCC, RCC and LPCC. It also shows that ΔMFCC feature can perform better than any other feature extraction method in all of eight different environmental conditions.

4. Visual-Only Speaker Identification

4.1. Facial Feature Extraction and Dimensionality Reduction

After acquisition of a face image, Stams [49] Active Shape Model (ASM) has been used to detect the facial features. Then the binary image has been taken. The Region Of Interest (ROI) has been chosen according to the ROI selection algorithm [50, 51]. Lastly the background noise has been eliminated [52] and finally appearance based facial feature has been found. The procedure of the facial image pre-processing parts is shown in Figure 4 where Figures 4(d) and 4(e) shows the shape based and appearance based facial feature respectively.

To improve the performance of the face recognition system and since we want to compare the proposed technique with the appearance and shape based feature fusion method, we have to combine the appearance and shape based features. The concatenation procedure of two different features is shown in Figure 5. Initially raw 5000 dimension appearance based features and 176 dimension shape based features are extracted. The Principal Component Analysis method [53, 54] has been used to reduce the dimension of appearance and shape based features into 192 and 14, respectively. Two different features are added and produced 206 dimension features. Finally, PCA has been used again to resize from 206 dimensional to 130 dimensional appearance-shape based facial feature vector.

4.2. Experimental Results according to ORL Facial Database

Olivetti Research Laboratory (ORL) face database [55] produced by AT&T Laboratories has been used for measuring the performance of the proposed system. The database contains 10 different images of 40 distinct subjects. For some of the subjects, the images were taken at different times, varying lighting slightly, facial expressions (open/closed eyes, smiling/nonsmiling), and facial details (glasses/no-glasses). All the images are taken against a dark homogeneous background and the subjects are in upright, frontal position (with tolerance for some side movement). The size of each face image is 92 × 112 and 8-bit grey levels. Experiment results are evaluated according to various dimensions such as optimum value selection of the number of hidden states of DHMM, response of the system based on noisy facial images and the system accuracy based on appearance, shape and combined appearance and shape based facial features.

4.2.1. System Response for Noisy Facial Images

The facial identification performance has been tested with the variations of different noises. Filtering is used for modifying or enhancing an image. To emphasize certain features or remove other features from an image, different filtering techniques are used. Filtering is a neighbourhood operation in which the value of any given pixel in the output image is determined by applying some algorithm to the values of the pixels in the neighbourhood of the corresponding input pixel. A pixel’s neighbourhood is some set of pixels defined by their locations relative to that pixel. To remove the noise from the facial images, wiener filtering technique has been used. Wiener filtering technique has been used to remove or reduce white Gaussian noise from the facial image. Wiener filter can be used adaptively to an image where the variance is large, wiener filter performs little smoothing and where the variance is small, wiener filter performs more smoothing. Wiener filtering technique performs selective operation compared with other filters, preserving edges and other high-frequency parts of an image.

For measuring the accuracy of the face system, noise has been added in various rates for appearance based, shape based and appearance-shape based feature fusion technique with PCA based dimensionality reduction where Euclidian distance has been used as a classifier. Table 10 shows the response of applying Wiener filtering technique.

4.2.2. Performance Measurements between Single and Multiple Feature Fusion Based Techniques

Facial identification performance has been measured according to individual feature based technique such as appearance based feature, shape based feature and appearance-shape based feature fusion based technique. Receiver Operating Characteristics (ROC) curve is generated for the above mentioned techniques where a tradeoff is made between security and user friendness. The performance graph is shown in Figure 6. From the graph, it is shown that the appearance-shape based feature fusion can achieve compared with highest accuracy individual appearance based and shape based technique. For example, at a FRR = 30%, the appearance based, shape based and appearance-shape feature fusion FAR are 42%, 30%, and 28% respectively.

5. Audio and Visual Reliability Measurements

Since DHMM learning and testing models have been adopted for the audio and visual system, an ergodic discrete HMM (DHMM), [56], has been built in DHMM training phase for each face . The model parameters , and have been estimated to optimize the likelihood of the training set observation vector for the th face by using the Baum-Welch algorithm. The Baum-Welch reestimation formula has been considered as follows [57, 58]:

In the DHMM testing phase, for each unknown face to be recognized which includes(i)measurement of the observation sequence, , via a feature analysis of the corresponding face,(ii)transformation of the continuous values of into integer values,(iii)calculation of model likelihood for all possible models, , ,(iv)declaration of the face as person whose model likelihood is highest—that is,

In this work, the probability computation step has been performed using Baum’s Forward-Backward algorithm [58, 59]. By applying HMM as a learning phase, the log likelihood of each appearance and shape based feature of each person face have been captured. After getting the log likelihood of each modality separately, their outputs are combined by a weighted sum rule to produce the final decision. In this work, match score level is used to combine the appearance and shape based outputs. For a given appearance-shape test datum of and , the final recognition is given by [60] where and are the appearance and the shape HMMs for the th utterance class, respectively, and and are their log-likelihood against the th class.

Among various types of score fusion techniques, baseline reliability ratio-based integration has been used to combine the appearance and shape recognition results. The reliability of each modality can be measured from the outputs of the corresponding HMMs. The reliability of each modality can be calculated by the most appropriate method which is the best in performance [61],

Which means the average difference between the maximum log-likelihood and the other ones and is the number of classes being considered to measure the reliability of each modality, .

Then the integrated weight of appearance based reliability measure can be calculated by [62] where and are the reliability measures of the outputs of the appearance and shape HMMs, respectively.

The integrated weight of shape modality measure can be found as

6. BPN Based Likelihood Ratio Score Fusion

A Back-propagation learning feed-forward neural network [63] with tan-sigmoid transfer functions has been used in both the hidden layer and the output layer which is shown in Figure 7. Three-layer Back-propagation learning neural network algorithm has been used to classify the visual speech features [64].

If the input vector is , then the output of hidden layer has been calculated as follows: where, is weight vector and is bias input. The error is calculated as the difference between the target output and the network actual output. The goal is to minimize the average of the sum of these errors. Consider the following:

Here, mse means mean square error, represents the target output, and represents the network output. The weights and bias values are updated based on the goal average error value.

In the proposed audio-visual system, the final weights and bias values are calculated in the training stage. In test phase, the output of the network has been calculated for the new input and compared with the target output to select the class of the input. The numbers of input layers, hidden layers and output layers nodes are 2, 100, and 8, respectively. The overall procedure for the proposed system with likelihood ratio based score fusion with Back-propagation learning neural network is shown in Figure 8.

The major drawbacks of Back-propagation learning neural network algorithm are the training time and local minima. Convergence time of the Back-propagation algorithm is inversely proportional to the error tolerance rate. In learning, effective use of error rate can decrease the convergence time. At first, select the final error rate. Then converge the weights such that all the patterns overcome some of the percentage error of the total system (the error must be higher than the final error rates). Finally, converge the system to the next lower error rate until crossing the final targeted error. For example, if the error rate of the system is 0.001, first the converged error rate for all of the patterns is 0.009, then 0.005, 0.003, and finally 0.001. This process is known as SET-BPL [65].

100 speech utterances are trained in Back-propagation learning neural network and the effects of applying SET-BPL of the proposed system areshown in Figure 9. Sometimes local minima problem occurs in a Back-propagation learning neural network algorithm. As a result, some precautions such as addition of internal nodes and lowering the gain term have been considered to set the learning parameters. The addition of internal nodes and lowering the gain term can increase the convergence time. To overcome these learning difficulties, a momentum term has been used to speed up the convergence process for this proposed speaker identification system.

6.1. Experimental Evaluation according to VALID Audio-Video Database

VALID audio-visual multimodal database [66] has been used to measure and compare the accuracy between the proposed and existing system. For visual features, database contains 106 subjects each with four office lighting conditions, gathered periodically over one month giving some temporal variation and one studio session with controlled lighting. The 576 × 720 stills were extracted from the video segments. Three sets of these are offered, the 1st, 10th, and 50th frames for each of the 106 × 5 sessions. The five sessions were recorded over a period of one month, allowing for variation in the voice, clothing, facial hair, hairstyle and cosmetic appearance of the subjects and also variation of the visual background, illumination, and acoustic noise. The first session was recorded under controlled acoustic/illumination conditions, that is, controlled environment. The database is designed to be realistic and challenging; hence the other four sessions were recorded in noisy real-world scenarios with no control on illumination or acoustic noise that is, uncontrolled environment. Some processed facial images of VALID database are shown in Figure 10.

For the speech wave, the database contains 106 subjects with one studio and four office conditions recordings for each person corresponding to the facial images where the following two different speech utterances are found:Uttered sentence 1:“Joe Took Father’s Green Shoe bench Out,”Uttered sentence 2: “5 0 6 9 2 8 1 3 7 4.”

From the above two sentences, second sentence has been used for learning and testing operations. Out of five facial images and speeches, the neutral face and corresponding speech utterance have been used for learning and other four official noisy images and speeches have used for testing.

6.2. Performance Analysis between Existing and Proposed Method

To evaluate the performance of BPN based likelihood ratio based score fusion technique, different variations of audio and visual features are combined and results are taken according to various SNRs of audio signal which are shown in the following subsections.

6.2.1. Experiment of Appearance Based Facial Feature with Audio Feature

Appearance based facial features are concatenated with MFCC based audio feature to populate the performance of the proposed score fusion based speaker identification. Results are shown in Figure 11 where the highest speaker identification rate has been found to be 95% at SNR of 30 dB for proposed BPN score fusion approach compared with existing Rogozan and Deléglise method of 91.33%.

6.2.2. Experiment of Shape Based Facial Feature with Audio Feature

Figure 12 shows the results of shape based facial feature and MFCC based audio feature. At SNR of 30 dB, the speaker identification rate of Rogozan and Deléglise method and proposed BPN score fusion approach has been achieved with 93.33% and 96.33%, respectively.

6.2.3. Experiment of Combined Appearance-Shape Based Facial Feature with Audio Feature

Results of appearance and shape based facial features with MFCC based audio feature for score fusion technique are shown in Figure 13. The highest speaker identification rate of 98.67% has been found at SNR of 30 dB with proposed BPN score fusion approach where existing Rogozan and Deléglise method achieves 95% at the same SNR.

Form the above experimental results, it has been shown that Back-propagation learning network based score fusion approach gives greater performance than any combination of audio and visual features compared with existing Rogozan and Deléglise method of score fusion. Here, it has also been focused on that combined appearance and shape based facial feature achieves higher accuracies than any individual facial feature based technique which is shown in Table 11.

7. Conclusions and Observations

In this work, proposed system performance has been evaluated in various levels with various dimensions. Two different types of facial features are combined with audio feature with various artificial noise addition rates. NOIZEUS speech database has been used to evaluate the performance of the Audio-Only speaker identification system whereas ORL facial database has been used for visual-only identification system. Finally, overall performance that is, audio-visual speaker identification has been measured according to VALID audio-visual database. Noise removing techniques have used to reduce or eliminate the noises from speech utterances and facial images. Experimental results and performance analysis shows the versatility of the proposed BPN score fusion approach over the existing Rogozan and Deléglise method for audio-visual speaker identification system which can be effectively used in various real life access control and authentication purposes.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.