Abstract

Describing the dynamic textures has attracted growing attention in the field of computer vision and pattern recognition. In this paper, a novel approach for recognizing dynamic textures, namely, high order volumetric directional pattern (HOVDP), is proposed. It is an extension of the volumetric directional pattern (VDP) which extracts and fuses the temporal information (dynamic features) from three consecutive frames. HOVDP combines the movement and appearance features together considering the order volumetric directional variation patterns of all neighboring pixels from three consecutive frames. In experiments with two challenging video face databases, YouTube Celebrities and Honda/UCSD, HOVDP clearly outperformed a set of state-of-the-art approaches.

1. Introduction

The texture of objects in digital images can be generally categorized into two main types, static texture and dynamic texture, which is an extension of texture in the temporal domain. Local feature detection and description have gained much attention in recent years since photometric descriptors computed for regions of interest have proven to be very successful in many computer vision applications. In the context of texture (feature) analysis methods, there are two common types of techniques: the structural approaches, where the image texture is considered as a repetition of some primitives with a specific rule of placement, and the statistical methods. The stochastic properties of the spatial distribution of gray levels in an image are characterized by the gray tone cooccurrence matrix. A set of textural features derived from the cooccurrence matrix is widely used to extract textural information from digital images [1].

Face recognition (FR) is one of the most suitable technologies that has been spread in several applications such as biometric systems, access control and information security systems, surveillance systems, content-based video retrieval systems, credit-card verification systems, and more generally image understanding. FR is a biometric approach that employs automated methods to verify or recognize the identity of a living person based on their physiological characteristics. The key of each face recognition system is the feature extractors, which should be distinct and stable under different conditions. FR system can generally be categorized into one of the two main scenarios based on the characteristics of the images to be matched, such as still-image-based (still-to-still) FR [24] or video-based (video-to-video) FR. Also it could be a video-to-still-image-based face recognition system [5].

Dynamic texture (DT) or temporal texture is a texture with motion that includes the class of video sequences, which offers some stationary properties in time. Recently researchers start to investigate the domain of video, where the problem of face recognition becomes more challenging due to pose variations, different facial expressions, illumination changes, occlusions, and so on. However, DT provides many samples of the same person, thus providing the opportunity to convert many weak examples into a strong prediction of the identity. Zhao and Pietikäinen introduced an extended version of LBP named volume local binary patterns (VLBP) for video-based facial expression recognition [6]. They claim that the features extracted in a small local neighborhood of the volume can be boosted by combining the motion and appearance. These features are insensitive to translation and rotation. However, there are some illumination limitations since this method deals with the small local neighborhood of a pixel as well as utilizing the image intensity directly. While there are other methods for still and video-based face recognition procedure that do not rely on the image features, they depend on the sparse representation based categorization strategy, which considers the local sparsity identification from sparse coding coefficients [7, 8]. Zheng et al. [9] recently introduced a full system for unconstrained video-based face recognition, which is composed of face/fiducial detection, face association, and face recognition.

Nowadays, manifold features (linear subspaces), if the features lie in Euclidean spaces, have proven a powerful representation for video-based face recognition. Huang et al. [10] recently introduced a new method called projection metric learning on Grassmann manifold (PML), which is combined with Grassmannian graph-embedding discriminant analysis (GGDA) [11]. In this technique, each video sequence can be treated as a set of face images without considering the temporal information. It serves as both a metric learning and a dimensionality reduction method for the Grassmann manifold to map the manifold to a reproducing kernel Hilbert space (RKHS). Although kernel-based methods have been successfully used in many computer vision applications, poor choice of kernels can often result in degraded classification performance [12], especially when the data lies in non-Euclidean spaces. Päivärinta et al. [13] introduced a blur-insensitive descriptor for dynamic textures, named volume local phase quantization (VLPQ). It is based on binary encoding of the phase information of the local Fourier transform. In this technique, each video sequence is processed to provide one feature vector.

In this paper, we introduce a new video-based facial feature extractor, named high order volumetric directional pattern (HOVDP). HOVDP is a histogram bin-based code assigned to each pixel of an input frame, which can be calculated by fusing twenty-four edge responses from three consecutive frames. These gradient values are detected using Kirsch masks in eight different directions. If there is any change (dynamic changes) in any relative gradient response from the corresponding frames at any direction, it would be detected and added to the features vector. Unlike the conventional VDP operator that encodes the volumetric directional information in the small local neighborhood of a pixel of each three consecutive frames, HOVDP extracts order volumetric information by encoding various distinctive spatial relationships from each neighborhood layer of a pixel in the pyramidal multistructure way and then concatenating the feature vector of each neighborhood layer to form the final HOVDP-feature vector.

The remainder of the paper is organized as follows. Section 2 introduces the observation model and related work. The proposed HOVDP algorithm is presented in Section 3. Experimental results and analysis are presented in Section 4. Finally, we offer conclusions in Section 5.

In this section, we present the observation of capturing any small changes of the face textures and merging the movement and appearance features together. Therefore, we deeply explain the essentials of the proposed technique which is our previous work, named volumetric directional pattern (VDP) [14, 15]. The main goal of the VDP is extracting and fusing the temporal information (dynamic features) from three consecutive frames which are distinct under multiple poses and facial expressions variations. Given a video as input and a gallery of videos, we perform face recognition process throughout the whole video clip. Firstly, we detect faces using Viola-Jones’s face detector [16]. Then for each frame we extract and combine the dynamic features of its two neighborhood frames. Then a histogram is built for each frame. These histograms are concatenated to form the final VDP-feature vector, similar to the gallery videos.

2.1. Volumetric Directional Pattern

Volumetric directional pattern (VDP) is a gray-scale pattern that characterizes and fuses the temporal structure (dynamic information) of three consecutive frames [14, 15]. VDP has been developed to merge the movement and appearance features together. It is a twenty-four-bit binary code assigned to each pixel of an input frame, which can be calculated by comparing the relative edge response value of a particular pixel from three consecutive frames in different directions by using Kirsch masks in eight different orientations centered on its own position for one frame and the corresponding positions of the other two frames. Kirsch mask is a first derivate filter which is used to detect edges in all eight directions of a compass considering all eight neighbors [17]. Specifically, it takes a single mask, denoted as for , and rotates it in increments through all 8 compass directions as follows:

Given a central pixel in the middle (center) frame of three consecutive frames, the eight different directional edge response values for are used to create an eight bit binary number which can describe the edge response pattern of each pixel in the center frame (frame of interest). Meanwhile, the eight different edge response values for and for are used to create an eight-bit binary number each, which can describe the edge response pattern of each pixel in the previous frame and next frame, respectively. Figure 1 shows the twenty-four edge responses and their corresponding bit binary positions, as well as the fusing strategy of this 24-bit code. The twenty-four different directional edge response values for each pixel location can be computed byandwhere represents the dot product operation, is the mask, and , , and are the neighbors of each pixel of previous, center, and next frames, respectively. , , and are the spatiotemporal directional response values of the first layer for the previous, center, and next frames, respectively.

In order to generate the VDP-feature vector, we need to know the most prominent directional bits for all three consecutive frames. These bits are set to 1 and the rest of 8-bit VDP pattern of each frame are set to 0. Then a binary code is formed to each pixel from each frame, which will be mapped to its own bin to build a histogram. Finally, we concatenate these three histograms of these three consecutive frames to obtain the final VDP-feature vector, which is the descriptor for each center frame (frame of interest) that we used to recognize the face image by the help of a classifier. The final VDP code can be derived by

andwhere the vertical bars represent the concatenation process of the histograms.

3. Method

Derived from a general definition of texture in a local neighborhood, the conventional VDP or the first order volumetric directional pattern encodes the directional information in the small local neighborhood of a pixel of each three consecutive frames, which may fail to extract detailed information especially during changes in data collection environments. Therefore, we proposed an improved version of VDP to tackle this problem by calculating the order volumetric directional variation patterns, namely, high order volumetric directional pattern (HOVDP). The proposed HOVDP can capture more detailed discriminative information than the traditional VDP. Unlike the VDP operator, the proposed HOVDP technique extracts order volumetric information by encoding various distinctive spatial relationships from each neighborhood layer of a pixel in the pyramidal multistructure way and then concatenating the feature vector of each neighborhood layer to form the final HOVDP-feature vector. Several observations can be made for HOVDP:(i)Under the proposed framework, the original VDP is a special case of HOVDP, which simply calculates the order volumetric directional information in the local neighborhood of a pixel.(ii)The relation between the neighbor layers and the pixel under consideration could be easily weighted in HOVDP based on the distance between each layer and the central pixel. Because of that, the pixels within the closest layer to the central pixel has more weight than the others.(iii)Due to the same format and feature length of different order HOVDP, they can be readily fused, and the accuracy of the face recognition can be significantly improved after the fusion.

3.1. High Order Volumetric Directional Pattern

The proposed high order volumetric directional pattern (HOVDP) technique is an oriented and multiscale volumetric directional descriptor that is able to extract and fuse the information of multiple frames, temporal (dynamic) information, and multiple poses and expressions of faces in input video to produce strong feature vectors. Given a central pixel in the middle (center) frame of three consecutive frames, to calculate the second order volumetric directional pattern () we first compute the first order (), which is exactly the same as the original VDP by using the (2)–(6). Then,andwhere , , and are an eight-bit binary number that describes the edge response pattern of each pixel of the first layer in the previous frame, center frame, and next frames, respectively.

To make our calculation simple and easy to compute the high order relevant edge values, let us assume the magnitude values for each layer separately after convolving the input image with Kirsch masks in eight different directions for , which can be seen in Figure 2. Then the directional edge values for the particular layer can be found asandwhere and are the eight different directional edge response values of the first and second neighborhood layers, respectively. is the magnitude value after convolving the input image with Kirsch kernels, the subscripts and are the number of surrounding pixels of each direction and the second neighborhood layer (second order), respectively, and is the modulo operation which is used to maintain the circularly neighborhood configuration.

Based on the observation that every corner or edge has high response values in particular directions, we are interested to know the most prominent directional bits for all three consecutive frames in order to generate the feature vector of each neighborhood layer. These bits are set to 1 and the rest of 8-bit pattern in each layer from each frame are set to 0. Then a binary code is formed to each pixel in each layer from each frame, which will be mapped to its own bin to build a histogram for that particular layer of each frame. The volumetric directional pattern of each pixel position in the second neighbor layer can be formed aswhere the thresholding function can be defined as in (6).

After identifying the volumetric directional pattern of each pixel in each neighborhood layer from each frame ( for the first layer and for the second layer), a histogram of 256 bins is built to represent all the distinguishing features of each neighbor layer from each frame separately, which means each layer will provide a feature vector of 768-bin size. Then to obtain the final order VDP-feature vector, which is the descriptor for each center frame (frame of interest), we concatenate these histograms starting from the same layer order one by one, which can be seen in Figure 3. The order VDP-feature vector size would be 1536 bins . Therefore, the feature vector size for order is .

In a general formulation, the order volumetric directional pattern of each pixel position in each neighbor layer from each frame can be defined aswhere the subscript is the volumetric directional pattern order (the number of neighborhood layers that has been used for the calculation process); , , and are the most significant directional responses of each neighboring layer from next frame, center frame, and previous frame, respectively; and is the thresholding function that can be defined as in (6). , , and are the eight different directional edge response values of each neighborhood layer from next frame, center frame, and previous frame, respectively, which can be computed asandThe same procedure is applied to find the eight different directional edge response values of next frame and previous .

4. Results and Discussion

To evaluate the robustness of the introduced method in illumination, pose, and expression variations, we tested it on two publicly available datasets, namely, YouTube Celebrities dataset [18] and Honda/UCSD database [19, 20]. All the face images in this work were detected by using the Viola and Jones’s face detector [16]. After manually removing the false detection, all the detected face images were resized to , and then the spatiotemporal information was extracted using the proposed high order VDP technique. When it comes to the face recognition process, we represent the face using a HOVDP-feature histogram. The objective is to compare the encoded feature vector from one frame with all other candidate feature vectors using two well-known classifiers. The first one is support vector machine (SVM) classifier (we used LibSVM) [21], and the second one is a k-nearest neighbors (KNN) classifier. The corresponding face of the HOVDP-feature vector with the lowest measured value indicates the match found.

Two different experiments are conducted for each database to verify the effectiveness and efficiency of the proposed HOVDP framework. The first one explores the effectiveness of different volumetric directional order (different neighborhood layers for each pixel of the image) as changing the number of the most prominent response values . The second one evaluates the effectiveness of the proposed HOVDP by comparing it with four popular video-based face recognition techniques as well as with our conventional VDP. To avoid any bias, we randomly selected the data for training and testing.

Considering computational efficiency, we observe that HOVDP requires longer execution time compared to the VLPQ method. For example, during the processing of one video clip that consists 248 frames, HOVDP takes around 2 minutes, while VLPQ requires around 2 seconds. The reason is that HOVDP computes features using every three adjacent frames for all pixels, which increases the feature dimension by adding each neighborhood layer; in contrast, the abovementioned competing VLPQ method computes the features based on Fourier transform estimation, which is performed locally using short-term Fourier transform using 1D convolutions, and the convolutions are computed using only valid areas, i.e., areas that can be computed without zero-padding. The convolutions that occur multiple times in the process are calculated only once [13]. The computing platform is an Intel Core i5 2.27-GHz machine with 4 GB of RAM, and all implementations are performed on MATLAB-2016a.

4.1. YouTube Celebrities Dataset

YouTube Celebrities database is a large-scale video dataset which contains video sequences of different celebrities (actors and politicians) that are collected from YouTube. The dataset is considered as one of the most challenging video databases due to the large illumination, pose, and expression variations as well as low resolution and motion blur. In this part, we evaluated the proposed HOVDP on all celebrities, while some of the state-of-the-art compared methods were evaluated on some of the subjects (e.g., Yang et al. [23] use only the first celebrities). Following the prior works [10, 22], for each subject three video sequences are randomly selected as the training data, with the other six video clips randomly selected for testing. While using the publicly available code of VLPQ technique [13], for each subject, six video sequences are randomly selected as the training data, with the other three video clips randomly selected for testing. For this publicly available code, we have used the default settings of all its parameters which yield the optimal performance. We conduct one experiment by random selection of training/testing data. The clips contain different numbers of frames (from 8 to 400) which have mostly low resolution and are highly compressed. Figure 4 shows some examples of cropped faces in this dataset.

To show the effectiveness of the proposed HOVDP technique, we summarize the recognition rates via changing the neighborhood layers size (the order of the proposed approach) in the range where 1 means window size, 2 means window size, etc., varying the threshold (the most prominent edge response values) in the range , as well as comparing with VDP (the special case of HOVDP) in Table 1. From Table 1, it is found that yields optimal performance for this dataset. Additionally, it is clear that the second order VDP improves the face recognition accuracy of the first order VDP in all test cases, while the third order and fourth order decrease the accuracy rates due to the fact that increasing the scale (the neighbors pixels) causes it to extract and fuse the information of different poses and different locations of the face components, which produces confused feature vectors. In addition, this increase of the descriptor order (the neighborhood layers size) would increase the feature vector length, which slows down the feature extraction processing speed.

The performance results of well-known face recognition algorithms like regularized nearest points (RNP) [23], sparse approximated nearest points between image sets (SANP) and its kernel extension (KSANP) [22], and projection metric learning on Grassmann manifold (PML) [10] combined with Grassmannian graph-embedding discriminant analysis (GGDA) [11] denoted as (PML-GGDA), with the proposed method HOVDP and the original VDP, as well as with the descriptor of the dynamic texture VLPQ [13] on this dataset, are presented in Table 2. Notice that the results we compared ours with are what we got from their original references, which are mentioned in the table. While for the VLPQ technique, the obtained results are based on its default settings, which yield the optimal performance. Meanwhile, a part of this dataset was used in RNP [23] and three video sequences were randomly selected as the training data, with the other three sequences randomly selected as the testing data.

4.2. Honda/UCSD Dataset

Honda/UCSD database consists of video sequences of different subjects. There are pose, illumination, and expression variations across the sequences for each subject. Each video consists of about frames. Figure 5 has shown some examples [20]. Each row corresponds to an image set of a subject. In our experiment, we use the standard training/testing configuration provided in [19]. 20 sequences are used for training which means one video for each subject and the remaining 39 sequences for testing. We report results using all frames as well as with limited number of frames. Specifically, we conduct the experiments following the prior works [22, 23] by executing three parts of experiments, using only the first 50 frames/video clip, using only the first 100 frames/video clip, and using all video frames. In case a set contains frames fewer than the selected ones, all frames are used for recognition process. The performance results of the proposed technique HOVDP as changing the number of neighborhood layers (the order of the proposed approach) in the range along with changing the threshold (the most prominent edge response values) in the range using set lengths 50 frames/clip, 100 frames/clip, and all frames/clip, respectively, are presented in Tables 3, 4, and 5.

The performance results of well-known video-based algorithms like SANP, KSANP, RNP, VLPQ, and the original VDP (the special case of the proposed technique) with the proposed method HOVDP on this dataset are presented in Table 6. Notice that the results we compared ours with are what we got from their original references, which are mentioned in the table. While for the VLPQ technique, the obtained results are based on its default settings, which provide the optimal performance. For our proposed methods, we select the value of that yields optimal performance for the comparison.

5. Conclusion

In this paper, we introduced a new feature descriptor, namely, HOVDP. Throughout the performance evaluation in terms of face recognition accuracy, we found that HOVDP is robust for video-based face recognition applications. With a video as input and a gallery of videos, we performed face recognition process throughout all the video clip frames. From the evaluation results, it has been found that the proposed HOVDP algorithm can successfully improve the accuracy rates compared to the original VDP in all test cases and exceed a set of state-of-the-art methods in most test cases. For the Honda/UCSD database, our proposed technique provides better recognition rates, although the other compared methods outperform ours in case all frames are used. Meanwhile, our proposed HOVDP beats the others in case smaller sets are used, which often occurs in real-world applications. For example, the tracking of a face may fail for a long video sequence when only the first part of the video sequence is available for the classification.

Data Availability

The video sequences that have been used in this paper may be found in the following links: YouTube Celebrities dataset at http://seqamlab.com/software-and-data/ Honda/UCSD dataset at http://vision.ucsd.edu/~iskwak/HondaUCSDVideoDatabase/HondaUCSD.html.

Disclosure

This manuscript was prepared based on the research that has been conducted as a part of the doctoral dissertation work of Almabrok Essa at the University of Dayton, Dayton, Ohio. The original dissertation document is available at https://etd.ohiolink.edu/!etd.send_file?accession=dayton1500901918995427&disposition=inline.

Conflicts of Interest

The authors declare that they have no conflicts of interest.