Abstract

Based on a special type of denoising autoencoder (DAE) and image reconstruction, we present a novel supervised deep learning framework for face recognition (FR). Unlike existing deep autoencoder which is unsupervised face recognition method, the proposed method takes class label information from training samples into account in the deep learning procedure and can automatically discover the underlying nonlinear manifold structures. Specifically, we define an Adaptive Deep Supervised Network Template (ADSNT) with the supervised autoencoder which is trained to extract characteristic features from corrupted/clean facial images and reconstruct the corresponding similar facial images. The reconstruction is realized by a so-called “bottleneck” neural network that learns to map face images into a low-dimensional vector and reconstruct the respective corresponding face images from the mapping vectors. Having trained the ADSNT, a new face image can then be recognized by comparing its reconstruction image with individual gallery images, respectively. Extensive experiments on three databases including AR, PubFig, and Extended Yale B demonstrate that the proposed method can significantly improve the accuracy of face recognition under enormous illumination, pose change, and a fraction of occlusion.

1. Introduction

Over the last couple of decades, face recognition has gained a great deal of attention in the academic and industrial communities on account of its challenging essence and its widespread applications. The study of face recognition has a great theoretical value, which involves image processing, artificial intelligence, machine learning, computer vision, and so on, and it also has a high correlation with other biometrics like fingerprints, speech recognition, and iris scans. In the field of pattern recognition, as a classic problem, face recognition mainly covers two issues, feature exaction and classifier design. Currently, most existing works are focusing on these two aspects to promote the performance of face recognition system.

In most real-world applications, it is actually a multiclass classification issue for face recognition. There are many classification methods proposed by researchers. Among them, nearest neighbor classifier (NNC) and its variants like nearest subspace [1] are the most popular methods in pattern classification [2]. In [3], the problem of face recognition was transformed to a binary classification problem through constructing intra- and interfacial image spaces. The intraspace stands for the difference of the same person and the interspace denotes the difference of different people. Then, many binary classifiers such as Support Vector Machine (SVM) [4], Bayesian, and Adaboost [5] can be used.

Besides the classifier design, the other important issue is feature representation. In the real word, face images are usually influenced by variances such as illuminations, posture, occlusions, and expressions. Additionally, there is fact that the difference from the same person would be much larger than that from different people. Therefore, it is crucial to get efficient and discriminant features making the intraspace compact and expanding the margin among different people. Until now, various feature extraction methods have been explored, including classical subspace-based dimension reduction approaches like principal component analysis (PCA), fisher linear discriminant analysis (FLDA), independent component analysis (ICA), and so on [6]. In addition, there are some local appearance features extraction methods like Gabor wavelet transform, local binary patterns (LBP), and their variants [7] which are stable to local facial variations such as expressions, occlusions, and poses. Currently, deep learning including deep neural network has shown its great success on image expression [8, 9], and their basic idea is to train a nonlinear feature extractor in each layer [10, 11]. After greedy layer-wise training of a deep network architecture, the output of the network is applied as image feature for latter classification task. Among deep network architectures, as a representative building block, denoising autoencoder (DAE) [12] learns features that is robust to noise by a nonlinear deterministic mapping. Image features derived from DAE have demonstrated good performance in many aspects such as object detection and digit recognition. Inspired by the great success of DAE based deep network architecture, a supervised autoencoder (SAE) [9] was also proposed to build the block, which firstly treated the facial images in some variants like illuminations, expressions, and poses as corrupted images by noises. A face image without the variant through an SAE can be recovered; meanwhile, robust features for image representation are also extracted.

Taking as an example the great success of DAE and SAE based deep learning and inspired by the face recognition under complex environment, in this article, we present a novel deep learning method based on SAE for face recognition. Unlike existing deep stacked autoencoder (AE) which is an unsupervised feature learning approach, our proposed method takes full advantage of the class label information of training samples in the deep learning procedure and tries to discover the underlying nonlinear manifold structures in the data.

The rest of this paper is organized as follows. In Section 2, we give a brief review of DAE and the state-of-the-art face recognition based on deep learning. In Section 3, we focus on the proposed face recognition approach. The experimental results conducted on three public databases are given in Section 4. Finally, we draw a conclusion in Section 5.

In this section, we briefly review work related to DAE and deep learning based face recognition system.

2.1. Work Related to DAE

DAE is a one-layer neural network, which is a recent variant of the conventional autoencoder (AE). It learns to try to recover the clean input data sample from its corrupted version. The architecture of DAE is illustrated in Figure 1(a). Let there be a total of training samples and let denote the original input data. In DAE, firstly, let the input data be contaminated with some predefined noise such as Gaussian white noise or Poisson noise to obtain corrupted version such that is input into an encoder . Then an output of the encoder is used as an input of a decoder . Here and are the predefined activation functions such as sigmoid function, hyperbolic tangent function, or rectifier function [13] of encoder and decoder, respectively. and are the network parameters which denote the weights for the encoder and decoder, respectively. and refer to the bias terms. and present dimensionality of the original data and the number of hidden neurons, respectively. On the basis of the above definition, a DAE learns by solving a regularized optimization problem as follows: Here is the reconstruction error and denotes the Frobenius norm and is a parameter that balances the reconstruction loss and weight penalty terms. With reconstructing the clean input data from a corrupted version of it, a DAE can explore more robust features than a conventional AE only simply learning the identity mapping.

To further promote learning meaningful features, sparsity constraints [14] are utilized to impose on the hidden neurons when the number of hidden neurons is large, which is defined in the light of the Kullback-Leibler (KL) divergence as where is the number of neurons in one hidden layer, is determined by taking the average activation of a hidden unit (over all the training set), and is a sparsity parameter (typically a small value).

After finishing and learning, the output from encoder is input to the next layer. Through training such DAE layerwise, stacked denoising autoencoders (SDAE) are then built. Its structure is illustrated in Figure 1(b).

In the real-word application, like face recognition, the faces are usually influenced by all kinds of variances such as expression, illumination, pose, and occlusion. To overcome the effect of variances, Gao et al. [9] proposed supervised autoencoder based on the principle of DAE. They treated the training sample (gallery image) from each person with frontal/uniform illumination, neural expression, and without occlusion as clean data and test faces (probe images) accompanied by variances (expression, illumination, occlusion, etc.) as corrupted data. A mapping capturing the discriminant structure of the facial images from different people is learned, while keeping robust to the variances in these faces. Then robust feature is extracted for image presentation and the performance of face recognition is greatly enhanced.

2.2. Deep Learning Based Face Recognition System

In the early face recognition, there have been various face representation methods including hand-crafted or “shallow” learning ways [6, 7]. In recent years, with the development of big data and computer hardware, feature learning based on deep structure has been greatly successful in image representation field [8, 12, 15, 16]. By means of deep structure learning, the ability of model representation gets great enhancement and we can learn complicated (nonlinear) information from original data effectively. In [16], deep Fisher network was designed through stacking all the Fisher vectors, which greatly performed over conventional Fisher vector representation. Chen et al. [17] proposed marginalized SDAE to learn the optimal closed-form solution, which reduced the computational complexity and improved the scalability of high-dimensional descriptive features. Taigman et al. [18] presented a face verification system based on Convolutional Neural Networks (CNNs), which also obtained high accuracy of verification on the LFW dataset. Zhu et al. [19] designed a network structure that is composed of facial identity-preserving layer and image reconstruction layer, which can reduce intravariance and achieve discriminant information preservation. In [20], Hayat et al. proposed a deep learning framework based on AE with application to image set classification and face recognition, which obtained the best performance comparing with existing state-of-the-art methods. Gao et al. [9] further proposed an SAE which can be used to build the deep architecture and can extract the facial features that are robust to variants. Sun et al. [21] learned multiple convolutional networks (ConvNets) from predicting 10,000 subjects, which generalized well to face verification issue. Furthermore, they improved the ConvNets by incorporating identification and verification missions and enhanced recognition performance [22]. Cai et al. [23] stacked several sparse independent subspace analyses (sISA) to construct deep network structure to learn identity representation.

3. Proposed Method

This section presents our proposed approach whose block diagram is illustrated in Figure 2. Firstly, inspired by stacked DAE and SAE [9], we define Adaptive Deep Supervised Network Template (ADSNT) that can learn an underlying nonlinear manifold structure from the facial images. The basic architecture of ADSNT is illustrated in Figure 3(c) and the corresponding details are depicted in Section 3.1. To make the deep network perform well, similar to [20], we need to give it initialization weights. Then, the preinitialized ADSNT is trained to reconstruct the invariant faces which are insensitive to illumination, pose, and occlusion. Finally, having trained the ADSNT, we use the nearest neighbor classifier to recognize a new face image by comparing its reconstruction image with individual gallery images, respectively.

3.1. Adaptive Deep Supervised Network Template (ADSNT)

As presented in Figure 3(c), our ADSNT is a deep supervised autoencoder (DSAE) that consists of two parts: an encoder (EC) and a decoder (DC). Each of them has three hidden layers and they share the third layer, that is, the central hidden layer. The features learned from the hidden layer and the reconstructed clean face are obtained by using the “corrupted” data to train the SSAE. In the process of pretraining, we learn a stack of SAE, each having only one hidden layer of feature detectors. Then, the learned activation features of one SAE are used as “data” for training the next SAE in the stack. Such training is repeated a number of times until we get the desired number of layers. Although we use the basic SAE structure which is shown in Figure 3(a) [9] to construct the stacked supervised autoencoder (SSAE), Gao et al.’s stacked supervised autoencoder only used two hidden layers and one reconstruction layer. In this paper, we use three hidden layers to compose the encoder and decoder, respectively, whose structures are shown in Figures 3(b) and 3(c). The encoder part tries best to seek a compact low-dimensional meaningful representation of the clean/“corrupted” data. Following the work [20], the encoder can be formulated as a combination of several layers which are connected with a nonlinear activation function . We can use a sigmoid function or a rectified linear unit as nonlinear activation to map the clean/“corrupted” data to a representation as follows:where is a weight matrix of the encoder for the th layer with neurons and is the bias vector. The encoder parameters learning are achieved by jointly training the encoder-decoder structure to reconstruct the “corrupt” data by minimizing a cost function (see Section 3.2). Therefore, the decoder can be defined as a combination of several layers integrating a nonlinear activation function which reconstructs the “corrupt” data from the encoder output . The reconstructed output of the decoder is given bySo, we can describe the complete ADSNT by its parameter , where and , .

3.2. Formulation of Image Reconstruction Based on ADSNT

Now, we are ready to depict the reconstruction image based on ADSNT. The details are presented as follows.

Given a set of classes training images that include gallery images (called clean data) and probe images (called “corrupted” data), and their corresponding class labels , the dataset will be used to train ADSNT for feature learning. Let denote a probe image, and present gallery images corresponding to . It is desirable that and should be similar. Therefore, following the work [9, 22], we obtain the following formulation:where (see Section 3.1) are the parameters of ADSNT which is fine-tuned by learning. In this paper, we only explore the tied weights; that is, , , and (see Figure 3(c)). is the reconstruction image of the corrupted image . Like regularization parameter, balances the similarity of the same person to preserve and as similarly as possible. is a nonlinear activation function. is a parameter that balances weight penalty terms and reconstruction loss. presents the Frobenius norm and ensures small weight values for all the hidden neurons. Furthermore, following the work [9, 14], we impose a sparsity constraint on the hidden layer to enhance learning meaningful features. Then, we can further modify cost function and obtain the following objection formulation: whereHere the KL divergence between two distributions, that is, and that present or , is calculated. The sparsity is usually a constant (taking a small value, according to the work [9, 24], it is set to 0.05 in our experiments), whereas and are the mapping mean activation values from clean data and corrupted data, respectively.

3.3. Optimization of ADSNT

For obtaining the optimization parameter , it is important to initialize weights and select an optimization training algorithm. The training will fail if the initialization weights are inappropriate. This is to say, if we give network too large initialization weights, the ADSNT will be trapped in local minimum. If the initialized weights are too small, the ADSNT will encounter the vanishing gradient problem during backpropagation. Therefore, following the work [20, 24], Gaussian Restricted Boltzmann Machines (GRBMs) are adopted to initialize weight parameters by performing pretraining, which has been already applied widely. For more details, we refer the reader to the original paper [24]. After obtaining the initialized weights, the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimization algorithm is utilized to learn the parameters as it has better performance and faster convergence than stochastic gradient descent (SGD) and conjugated gradient (CGD) [25]. Algorithm 1 depicts the optimization procedure of ADSNT.

Algorithm 1. (learning adaptive deep supervised network template)

Input. Training images : classes, and each class is composed of the face with neutral expression, frontal pose, and normal illumination condition (clean data) and random number of variant faces (corrupted data). Number of network layers . Iterative number , balancing parameters , and , and convergence error .

Output. Weight parameters (1)Preprocess all images, namely, perform histogram equalization(2): Randomly select a small subset for each individual from (3)Initialize: Train GRBMs by using to initialize the (4)(Optimization by L-BFGS)For doCalculate using (6)If and , go to Return

Return. and .

Since training the ADSNT model aims to reconstruct clean data, namely, gallery images from corrupt data, it might learn an underlying structure from the corrupt data and produce very useful representation. Furthermore, we can learn an overcomplete sparse representation from corrupt data through mapping them into a high-dimensional feature space since the first hidden layer has the number of neurons larger than the dimensionality of original data. The high-dimensional model representation is then followed by a so-called “bottleneck”; that is, the data is further mapped to an abstract, compact, and low-dimensional model representation in the subsequent layers of the encoder. Through such a mapping, the redundant information such as illumination, poses, and partial occlusion in the corrupted faces is removed and only the useful information content for us is kept. In addition, we know that if we use AE with only one hidden layer and jointly linear activation functions, the learned weights would be analogous to a PCA subspace [20]. However, AE is an unsupervised algorithm. In our work, we make use of the class label information to train SAE, so if we also use only one hidden layer with a linear activation function, the learned weights by the SAE are thought to be similar to “LDA” subspace. However, in our structure, we apply the nonlinear activation functions and stack several hidden layers together, and then the ADSNT can adapt to very complicated nonlinear manifold structures. Some of reconstructed images based on ADSNT from AR database are shown in Figure 4(b). One can see that ADSNT can remove the illumination. For those face images with partial occlusion, ADSNT can also imitate the clean faces. This results are not surprising because the human being has the capability of inferring the unknown faces from known face images via the experience (for deep network structure, the experience learned derives from generic set) [9].

3.4. Face Classification Based on ADSNT Image Reconstruction

To better train ADSNT, all images need to be preprocessed. It is a very important step for object recognition including face recognition. The common ways include histogram equalization, geometry normalization, and image smoothing. In this paper, for the sake of simplicity, we only perform histogram equalization on all the facial images to minimize illumination variations. That is, we utilize histogram equalization to normalize the histogram of facial images and make them more compact. For the details about histogram equalization, one can be referred to see [26].

After the ADSNT is trained completely with a certain number of individuals, we can use it to perform on the unseen face images for recognizing them.

Given a test facial image which is also preprocessed with histogram equalization in the same way as the training images and presented to the ADSNT network, we reconstruct (using (3) and (4)) image from ADSNT, which is similar to clean face. For the sake of simplicity, the nearest neighbor classification based on the Euclidean distance between the reconstruction and all the gallery images identifies the class. The classification formula is defined aswhere is the resulting identity and is the clean facial image in the gallery images of individual .

4. Experimental Results and Discussion

In this section, extensive experiments are conducted to present and compare the performance of different methods with the proposed approach. The experiments are implemented on three widely used face databases, that is, AR [27], Extended Yale B [28], and PubFig [29]. The details of these three databases and performance evaluation of different approaches are presented as follows.

4.1. Dataset Description

The AR database contains over 4000 color face images from 126 people (56 women and 70 men). The images were taken in two sessions (between two weeks) and each session contained 13 pictures from one person. These images contain frontal view faces with different facial expression, illuminations, and occlusions (sun glasses and scarf). Some sample face images from AR are illustrated in Figure 5(a). In our experiments, for each person, we choose the facial images with neutral expression, frontal pose, and normal illumination condition as gallery images and randomly select half the number of images from the rest of the images of each person as probe images. The remaining images compose the testing set.

The Extended Yale B database consists of 16128 images of 38 people under 64 illumination conditions and 9 poses. Some sample face images from Extended Yale B are illustrated in Figure 5(b). For each person, we select the faces that have normal light condition and frontal pose as gallery images and randomly choose 6 poses and 16 illumination face images to compose the probe images. The remaining images compose the testing set.

The PubFig database is composed of 58,797 images of 200 subjects taken from the internet. The images of the database were taken in completely uncontrolled conditions with noncooperative people. These images have a very large degree of variability in face expression, pose, illumination, and so forth. Some sample images from PubFig are illustrated in Figure 5(c). In our experiments, for each individual, we select the faces with neutral expression, the frontal or near frontal pose, and normal illumination as galleries and randomly choose half the number of images from the rest of the images of each person as probes. The remaining images compose the testing set.

4.2. Experimental Settings

In all the experiments, the facial images from the AR, PubFig, and Extended Yale B databases are automatically detected using OpenCV face detector [30]. After that, we normalize the detected facial images (in orientation and scale) such that two eyes can be aligned at the same location. Then, the face areas are cropped and converted to 256 gray levels images. The size of each cropped image is 26 × 30 pixels. Thus, the dimensionality of the input vector is 780. Figure 6 presents an example from AR database and the corresponding cropped image. Each cropped facial image is further preprocessed with histogram equalization to minimize illumination variations. We train our ADSNT model with 3 hidden layers, where the number of hidden nodes for these layers is empirically set as 1024 → 500 → 120, because our experiments show that three hidden layers can get a sufficiently good performance (see Section 4.3.3).

In order to show the whole experimental process about parameters setting, we initially use the hyperbolic tangent function as the nonlinear activation function and implement ADSNT on AR. We also choose the face images with neutral expression, frontal pose, and normal illumination as galleries and randomly select half the number of images from the rest of the images of each person as probe images. The remaining images compose the testing set. The mean identification rates are recorded.

Firstly, we empirically set the parameter and sparsity target and fix the parameters and in ADSNT to check the effect of on the identification rate. As illustrated in Figure 6(a), where , ADSNT recognition method gets the best performance. Then, according to Figure 6(a), we fix the parameters and in ADSNT to check the influence of . As showed in Figure 6(b), when , our method achieves the best recognition rate. At last, we fix and , and the recognition rates are illustrated in Figure 6(c) with different value of . When , the recognition rate is the highest. From the plot in Figure 6, one can observe that the parameters , , and cannot be too large or too small. If is too large, the ADSNT would be less discriminative of different subjects because it implements too strong similarity preservation entry. But if is too small, it will degrade the recognition performance and the significance of similarity preservation entry. Similarly, can also not be too large, or the hidden neurons will not be activated for a given input and low recognition rate will be achieved. If is too small, we can get poor performance. For the weight decay , if it is too small, the values of weights for all hidden units will change very slightly. On the contrary, the values of weights will change greatly.

Using above those experiments, we gain the optimal parameter values used in ADSNT as , , and on AR database. The similar experiments also have been performed on Extended Yale B and PubFig databases. We can get the parameters setting as , , and on Extended Yale B database and , , and on PubFig database.

In the experiments, we use two measures including the mean identification accuracy with standard deviation and the receiving operating characteristic (ROC) curves to validate the effectiveness of our method as well as other methods.

4.3. Experimental Results and Analysis
4.3.1. Comparison with Different Methods

In the following experiments on the three databases, we compare the proposed approach with several recently proposed methods. These compared methods include DAE with 10% random mask noises [12], marginalized DAE (MDAE) [17], Constractive Autoencoders (CAE) [15], Deep Lambertian Networks (DLN) [31], stacked supervised autoencoder (SSAE) [9], ICA-Reconstruction (RICA) [32], and Template Deep Reconstruction Model (TDRM) [20]. We use the implementation of these algorithms that are provided by the respective authors. For all the compared approaches, we use the default parameters that are recommended in the corresponding papers.

The mean identification accuracy with standard deviations of different approaches on three databases is shown in Table 1. The ROC curves of different approaches are illustrated in Figure 7. The results imply that our approach significantly outperforms other methods and gets the best mean recognition rates for the same setting of training and testing sets. Compared to those unsupervised deep learning methods such as DAE, MDAE, CAE, DLN, and TDRM, the improvement of our method is over 30% on Extended Yale B and AR databases where there is a little pose variance. On the PubFig database, our approach can also achieve the mean identification rate of % and outperforms all compared methods. The reason is that our method can extract discriminative, robust information to variances (expression, illumination, pose, etc.) in the learned deep networks. Compared with a supervised method like RICA, the proposed method can improve over 16%, 19%, and 23% on AR, PubFig, and Extended Yale B databases, respectively. Our method is a deep learning method, which focuses on the nonlinear classification problem with learning a nonlinear mapping such that more nonlinear, discriminant information may be explored to enhance the identification performance. Compared with SSAE method that is designed for removing the variances such as illumination, pose, and partial occlusion, our method can still be better over 6% because of using the weight penalty terms, GRBM to initialize weights, and three layers’ similarity preservation term.

4.3.2. Convergence Analysis

In this subsection, we evaluated the convergence of our ADSNT versus a different number of iterations. Figure 8 illustrates the value of the objective function of ADSNT versus a different number of iterations on the AR, PubFig, and Extended Yale B databases. From Figure 8(a), one can observe that ADSNT converges in about 55, 28, and 70 iterations on the three databases, respectively.

We also implement the identification accuracy of ADSNT versus a different number of iterations on the AR, PubFig, and Extended Yale B databases. Figure 8(b) plots the mean identification rate of ADSNT. From Figure 8(b), one can also observe that ADSNT achieves stable performance after about 55, 70, and 28 iterations on AR, PubFig, and Extended Yale B databases, respectively.

4.3.3. The Effect of Network Depth

In this subsection, we conduct experiments on the three face datasets with different hidden layer of our proposed ADSNT network. The proposed method achieves an identification rate of , , and by three-hidden layer ADSNT network, that is, 1024 → 500 → 120, respectively, on AR, Extended Yale B, and PubFig datasets. Figure 9 illustrates the performance of different layer ADSNT. One can observe that three-hidden layer network outperforms 2-layer network, and the result of 3-layer ADSNT network is very nearly equal to those of 4-layer network on AR and Extended Yale B databases. We also observe that the performance of 4-layer network is a bit lower than that of 3-layer network on the PubFig database. In addition, the deeper ADSNT network is, the more complex its computational complexity becomes. Therefore, the 3-layer network depth is a good trade-off between performance and computational complexity.

4.3.4. Activation Function

Following the work in [9], we also estimate the performance of ADSNT with different activation functions such as sigmoid, hyperbolic tangent, and rectified linear unit (ReLU) [33] which is defined as . When the sigmoid is used as activation function, the objective function (see (6)) is rewritten as follows: where .

If ReLU is adopted as activation function, (6) is formulated as Table 2 shows the performance of the proposed ADSNT based on different activation functions conducted on the three databases. From Table 2, one can see that ReLU achieves the best performance. The key reason is that we use the weight decay term to optimize the objective function.

4.3.5. Timing Consumption Analysis

In this subsection, we use a HP Z620 workstation with Intel Xeon E5-2609, 2.4 GHz CPU, 8 G RAM and conduct a series of experiments on AR database to compare the time consumption of different methods which are tabulated in Table 3. The training time (seconds) is shown in Table 3(a) while the time (seconds) needed to recognize a face from the testing set is shown in Table 3(b). From Table 3, one can see that the proposed method requires comparatively more time for training because of initialization of ADSNT and performing image reconstruction. However, the procedure of training is offline. When we identity an image from testing set, our method requires less time than other methods.

5. Conclusions

In this article, we present an adaptive deep supervised autoencoder based image reconstruction method for face recognition. Unlike conventional deep autoencoder based face recognition method, our method considers the class label information from training samples in the deep learning procedure and can automatically discover the underlying nonlinear manifold structures. Specifically, a multilayer supervised adaptive network structure is presented, which is trained to extract characteristic features from corrupted/clean facial images and reconstruct the corresponding similar facial images. The reconstruction is realized by a so-called “bottleneck” neural network that learns to map face images into a low-dimensional vector and to reconstruct the respective corresponding face images from the mapping vectors. Having trained the ADSNT, a new face image can then be recognized by comparing its reconstruction image with individual gallery images during testing. The proposed method has been evaluated on the widely used AR, PubFig, and Extended Yale B databases and the experimental results have shown its effectiveness. For future work, we are focusing on applying our proposed method to other application fields such as pattern classification based on image set and action recognition based on the video to further demonstrate its validity.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This paper is partially supported by the research grant for the Natural Science Foundation from Sichuan Provincial Department of Education (Grant no. 13ZB0336) and the National Natural Science Foundation of China (Grant no. 61502059).