Abstract

Due to the variations of viewpoint, pose, and illumination, a given individual may appear considerably different across different camera views. Tracking individuals across camera networks with no overlapping fields is still a challenging problem. Previous works mainly focus on feature representation and metric learning individually which tend to have a suboptimal solution. To address this issue, in this work, we propose a novel framework to do the feature representation learning and metric learning jointly. Different from previous works, we represent the pairs of pedestrian images as new resized input and use linear Support Vector Machine to replace softmax activation function for similarity learning. Particularly, dropout and data augmentation techniques are also employed in this model to prevent the network from overfitting. Extensive experiments on two publically available datasets VIPeR and CUHK01 demonstrate the effectiveness of our proposed approach.

1. Introduction

With the advances in computer version [14], machine learning [58], and deep neural networks [9, 10], we enter into an era that it is possible to build a real world identification system. Person reidentification (Re-ID) problem aims to recognize individuals across cameras at different locations and time from a distributed multicamera surveillance system in large public spaces [11]. Given a probe image captured from one camera, a person reidentification surveillance system attempts to identify the person from a gallery of candidate images taken from a different camera. The same person can be observed differently in cross-view cameras (see Figure 1). So it is quite difficult to find a kind of feature which is reliable and distinct and directly adapt to changes and misalignment in cross-view condition. Because of these challenge issues, researches in person reidentification still mainly focus on people appearance features, with the acceptable assumption that people will not change their clothing during the whole monitoring period.

Existing methods on this research topic have primarily focused on two aspects. The first aspect is to extract robust and discriminative feature descriptors to identify persons. It has been indicated that three important cues for person reidentification are color information, texture descriptors, and interest points; some of these features are learned from datasets and others are designed by hand. Low-level features such as biologically inspired features (BIF) [12], color histograms and variants [1317], local binary patterns (LBP) [13, 14, 17, 18], Gabor features [14], and interest points (color SIFT [19, 20] and SURF [21]) were proposed to represent appearance features of different people from nonoverlapping cameras. Some other works have also investigated combinations of multiple visual features, including [13, 14, 16]. The second aspect is to develop metric learning methods to learn discriminative models. The idea of metric learning is to design classifiers to enforce features from the same person to be closer than those from different individuals. Usually used metric learning methods such as Large Margin Nearest Neighbour (LMNN) [16], Logistic Discriminant Metric Learning (LDML) [22], KISSME [18], and Marginal Fisher Analysis (MFA) [16] performed well in solving challenging issues. These approaches typically extract handcrafted features and subsequently learn the metrics. However, these methods optimize feature extraction and metric learning separately or sequentially which leads to suboptimal solutions easily.

In recent years, with the wide use of convolutional neural networks (CNN) in the tasks of object recognition, tracking [23], classification [24], and face recognition [25], it has been proved to have a strong automatic learning ability. However, CNN has little progress in person reidentification. In this paper, inspired by the outstanding performance on person reidentification and facial expression recognition in [26, 27], we introduce a deep learning architecture with joint representation learning and linear SVM top layer of CNN to measure the similarity of the comparing image pairs. We randomly select two pedestrian images and horizontally join them as a new resized input image. Joint representation learning method which refers to [26] reduces the complexity of the network rather than two input branches used in Siamese network. We replace the standard softmax layer with L2-SVM to measure the distance of pedestrians in different cameras and estimate whether the inputs of the two pedestrians are the same or not. Compared with softmax function for predicting class labels, we use linear SVM to measure the distance to the decision boundary that is more suitable for the person reidentification which is solved as ranking-like comparison issue. Since L1-SVM is not differentiable, we introduce L2-SVM which is differentiable during function optimization and more stable in numerical computation. Pretrained and dropout techniques are also used in the model to prevent the overfitting problem and boost the performance of person reidentification. The major contributions of this paper are twofold:(i)We present a deep learning network combined joint representation learning with linear SVM to increase discriminative power of CNN network.(ii)Extensive experiments are conducted on two benchmark datasets to validate the effectiveness of our architecture and achieve the best results.

The typical workflow of existing person reidentification system is shown in Figure 2. It indicates that most of them focus on two main components: feature representation and metric learning. The aim of feature representation is to develop discriminate and robust appearance of the same pedestrian across different camera views.

Global features are divided into two categories: color based and texture based features. HSV [28] and LAB [29] color histograms are normal color based features. LBP histogram [30] and Gabor filter [14] are used to describe the textures of images. Recently, based on these traditional color and texture features, some more distinct and reliable feature representations for pedestrians have been proposed. Bazzani et al. [31] proposed to use a global mean color histogram and recurrent local patterns through local epitomic analysis to represent a person which is called the histogram plus epitome (HPE). Farenzena et al. [28] proposed to combine weighted HSV histogram of two separated human bodies with salient texture and stable color region as famous symmetry-driven method (SDALF) approach. Yang et al. [32] developed the semantic Salient Color Names based Color Descriptor (SCNCD) employing color naming. Local maximal occurrence (LOMO) features [33] and Scale Invariant Local Ternary Pattern (SILTP) histograms are used to analyse the horizontal occurrence of local features and maximize the occurrence to describe the mean information of pixel features. However, handcrafted features are difficult to achieve the balance between discriminative power and robustness which are highly susceptible to cross-view variations caused by illumination, occlusions, background clutter, and view orientation variations.

Besides feature representation, metric learning is also widely applied for person reidentification. Metric learning is formulated to learn the optimal similarity from features of training images which have strong interclass differences and intraclass similarities. Xiong et al. [34] proposed regularized PCCA (rPCCA), kernel LFDA (kLFDA), and Marginal Fisher Analysis (MFA) when the data space is undersampled. Chopra et al. proposed an algorithm to learn a similarity metric from data [35]. Zheng et al. [36] introduced the Probabilistic Relative Distance Comparison (PRDC) model which aims to maximize the probability of a pair of right match having a smaller distance than that of a wrong match pair and optimizes the relative distance comparison. Prosser et al. [37] reformulated the person reidentification problem as a ranking problem and proposed the Ensemble RankSVM model learning a subspace where the potential true match is given highest ranking rather than any direct distance measure.

Recently, deep learning has become one of the state-of-the-art recognition algorithms, especially that CNN has shown great potential in computer vision tasks. Li et al. [38] propose a new filter pairing neural network (FPNN) that jointly optimizes feature learning, misalignment, occlusions, classification, photometric transforms, and geometric transforms to learn filter pairs encoding photometric transforms. Different from FPNN learning the joint representation of two images, Yi et al. [39] proposed Deep Metric Learning (DML) model inspired by a Siamese neural network that combines the separate modules together, learning the color feature, texture feature, and metric in a unified framework. Matsukawa and Suzuki [40] conducted a fine-tuning of CNN features on a pedestrian attribute dataset to bridge the gap of ImageNet classification and person reidentification and proposed a loss function for classifying combination attributes to increase discriminative power of CNN features. Ahmed et al. [41] presented a deep convolutional architecture with cross-input neighbourhood differences layer and subsequent layer that capture local relationships between the two input images based on mid-level features from each input image and summarized these differences.

3. Algorithm

In the person reidentification task, it usually needs to measure the similarity between gallery set and probe set. CNN is exactly proved to outperform on classification problems rather than comparison problems. Directly using CNN in person reidentification is not suitable and it is hard to leverage its power. In this section, we describe the proposed architecture of CNN specifically. Details of layers and the strategies we used in network training are introduced in the following subsections.

3.1. Joint Representation Learning

The standard pipeline of person reidentification includes feature extracting from input images and metric learning for those features across images. As mentioned above, optimizing feature representation and metric learning separately or sequentially easily leads to suboptimal solutions. Different from this ordinary framework of learning metric over handcrafted features, we develop to use joint representation learning on input images in our network which is similar to deep rank CNN proposed by Chen et al. [26].

Motived by human assessment, it is used to assess two images whether they belong to the same person by comparing their depicted appearance separately. For instance, pictures A, B, and C are three quite similar but different pedestrian images. Setting picture C as probe image, the discriminative region between A and C is a handbag that appeared in C. Compared with B, pedestrian A wears dress, while B wears pants. As we compare different pedestrian images separately, some value information will be ignored or hidden when appearance features are extracted independently. In our proposed model, jointly representing two input pedestrian images and generating discriminative information will instead separately input images with two branches.

3.2. Architecture

Our deep learning network (see Figure 3) is composed of five convolutional layers (C1, C2, C3, C4, and C5) to extract features, three subsampling layers (S1, S2, and S5), and two fully connected layers (F6, F7). One branch is used as the input of network instead of two branches used in [27]. Different from the architecture of network proposed in [26], the top layer of our network (L8) is linear SVM instead of ranking layer which is more discriminative for different pedestrians, and we also optimize the gradient backpropagating learning problem in linear SVM. Randomly given two pedestrian images I and J observed from two cross-view cameras with three color channels (RGB) and sized (), then we join them horizontally. Since pedestrian images are not square-shaped and all of them are quite small, both of the images are resized to in the experiment, and the new joint image is square with size of ; then a random crop is presented as the input to the whole network in order to get center areas of images we focus on. Processed by this method, the aspect of images remains nearly unchanged and it avoids a large number of parameters contained in Siamese network. The processed images are represented as , .

The first convolutional layer (C1) is convolved with 96 different filters (see Table 1) of size with a stride of 4 in each horizontal and vertical directions. Then the 96 various feature maps are passed through the Relu layer and subsampling layer (S1) with size of to reduce the maps into size. The Batch Normalization (BN) layer is employed before each of the Relu layers which allows the network to use much higher learning rates and less focus on initialization of weights and biases. The feature maps are more robust to illumination and variations. If we use filters and each filter is in size of m × m × C, the output consists of channels of height and width . The convolution operation is expressed as function : where and represent the th output channel at the th layer and the th input channel at the th layer; denotes convolutional kernel between the th and th feature map. The function is the Relu neuron activation function of the network and represented as . The max-pooling operation is formulated aswhere represents the pooling region with index .

The second convolutional layer (C2) takes the outputs of S1 as input with filters of size and gives 256 different feature maps. The third and fourth convolutional layers (C3 and C4) are both with filters of size and give 384 different feature maps. With the same size of filters in C3 and C4, the fifth convolutional layer (C5) provides 256 different feature maps. The two subsampling layers (S2 and S5) repeat the same pooling options as S1. The sixth and seventh fully connected layers (F6 and F7) connect with neurons from S5 layer and reduce to 4096 nodes and form compact and robust features. The fully connected layers are expressed asInstead of traditional softmax layer used in multiple classifications, we use L2-SVM objective for learning the lower level parameters in the top layer (L8) of the whole network to find the max margin of true match and false match over training sample pairs.

3.3. Linear SVM versus Softmax
3.3.1. Softmax

Softmax is usually used in deep learning technique at top layer of the network. It is a generalization of logistic regression to the case in multiclass classification. The class labels are formulated as , where K is the number of classes. Let be the activation in penultimate layer and let be the weight connecting between penultimate layer and softmax layer. The input to softmax is represented as The probability is defined asSo the predicted class label would be

3.3.2. Linear Support Vector

Softmax is usually used as activation function which is focused on classification and less suitable for ranking-like comparison issue of person reidentification. So in this paper, we proposed to use L2-SVM objection training CNN instead of softmax layer. In linear Support Vector Machines (SVM), corresponding data and labels are represented as , and the linear SVM is defined as the following constrained optimization:Equation (7) is known as typical L1-SVM, and a differentiable representation is known as L2-SVM, given as follows:L2-SVM is differentiable during optimization and imposes a bigger loss for points which violate the margin. Equation (9) shows the predicted class labels of probe setsWe use the L2-SVM as objective function in our deep network and backpropagate the gradients from linear SVM layer to learn parameters of network. Therefore, the partial derivative of weight w is formulated asThe penultimate activation h is given asIn this way, a joint representation based L2-SVM neural network is obtained and the following section will show its performance on two public datasets.

3.4. Training Strategies Used in CNN

Dropout. During the training, random dropping units which are along with their connection from the neural network is an efficient technique to prevent overfitting and approximately combine exponentially different network architectures efficiently. The dropout technique is usually performed during supervised training and the network is likely forced to learn an averaging model. In this paper, we use dropout in the two fully connected layers (F6, F7) and randomly drop out 50% neurons of these two layers.

Data Augmentation and Data Balancing. Data augmentation is a widely used trick in deep learning. Since neural networks need to be trained on a huge number of training images to achieve satisfactory performance, the public datasets used in person reidentification usually contain limited images. In the training set, the positive pairs (the matched sample pairs) are generally fewer than negative pairs (nonmatched sample pairs). So in the experiment, doing data augmentation is better to boost the performance when training the deep network. In the training set, we randomly crop the input images into patches and horizontally flip them around the -axis. These augmented data will be used as new input of our network. To achieve data balancing, we online sample the same number of positive pairs and negative pairs with a 1 : 1 positive-negative ratio in each minibatch size of 32 images at the very beginning of the training process. As the whole network achieves a reasonably good configuration after the initial training, the positive-negative ratio will gradually reach 1 : 5 to alleviate overfitting.

Stochastic Gradient Descent. Our model is trained using minibatch stochastic gradient descent (SGD) for faster backpropagation and smoother convergence. In each iteration of the training phase, 32 images of a minibatch are the input of the network. We use the SGD with a momentum of 0.9, the learning rate of , and weight decay of 0.0005. Note that for every 10000 iterations the learning rate will decrease by .

Pretraining and Fine-Tuning. The network proposed in this paper is a great depth network, so a great number of labeled images are needed to train the whole network. Before training on VIPeR and CUHK01 datasets, we use CUHK02 datasets to learn a pretrained model. When we test on different datasets, we fine-tune a few top layers of pretrained model with a small learning rate.

4. Experiments

Our proposed network is implemented by Theano deep learning framework. The network is trained in NVIDIA TITAN X. We evaluate the proposed method on several famous person reidentification datasets carried out to compare with state-of-the-art approaches. The results are shown in Cumulative Matching Characteristic (CMC) curve. The cumulative matching scores are also shown in Tables 29.

4.1. Datasets and Evaluation Protocol

Datasets. We evaluate our method on two public datasets: VIPeR dataset and CUHK01 dataset. The deep learning model is pretrained on CUHK02 dataset. VIPeR dataset is a relatively small and quite challenging dataset in person reidentification. It has 632 pedestrian pairs captured by two camera views in outdoor environment. Each pair contains two images of the same person seen from different viewpoints, including Cam A and Cam B. Images in Cam A are mainly from 0 to 90 degrees while images in Cam B are from 90 to 180 degrees. All images are normalized to .

The CUHK01 dataset is a larger dataset than VIPeR which contains 972 persons captured from two cross-views with 3884 images in a campus environment. Camera view A and camera view B include two images for the same person and view A captures the frontal or back view of the individuals while view B captures the profile view. All images are scaled to pixels. The CUHK02 dataset contains five pairs of views (P1-P2). Images from P2-P2 were used to learn a pretrained model.

Evaluation Protocol. In each experiment on different datasets, we randomly divide each dataset into gallery set and probe set. The gallery set is composed of two kinds of image pairs: positive pairs and negative pairs. The positive pairs are created by the same people from different camera views, and the negative pairs are created by two separate people. Specifically, for VIPeR dataset, we set the number of individuals in the gallery/probe sets split to 316/316. For CUHK01 dataset, we use 485 pedestrians for training and 486 for testing. We compare our method with some state-of-the-art methods on VIPeR and CUHK01 datasets. The whole procedure is repeated ten times, and the average of Cumulative Matching Characteristic (CMC) curves are used to evaluate the performance of different approaches.

4.2. Comparison with Feature Representation
4.2.1. Experiments on VIPeR Dataset

In this experiment, we pretrained the network model with CUHK02 dataset and randomly divide the 632 pairs of images into half for training and half for testing. We compare our proposed approach with the following three available and typical person reidentification features: Ensemble of Local Features (ELF) [42], gBiCov [12], and HSV with Lab and LBP feature proposed in [18]. In the experiment, we used ELF6 implemented in [42].

We compared our proposed method with these three different kinds of features, results of CMC curves, and top-ranked matching rates shown in Figure 4(a) and Table 2. From Figure 4(a), it can be observed that our approach gives the best result. Comparing to the three baseline methods, the performance of our approach gains is over 20% at rank-1. Such trend grows as the rank number increases. As shown in Table 2, our proposed method achieves a 34.15% rank-1 matching rate outperforming the ELF6 with 8.73%, gBiCov with 9.87%, and HSV_Lab_LBP with 12.47%. In our method, the feature learning is directly performed on the input images avoiding missing the critical information during the feature extracting by using handcrafted features. It confirms that utilizing deep convolutional neural network for learning feature representation and similarity measurement is an effective solution for solving people reidentification tasks.

4.2.2. Experiments on CUHK01 Dataset

Same as the pretrained strategy for CUHK02 dataset used on VIPeR dataset, we chose the following approaches as baselines: ELF18 [42], gBiCov [12], and Local Maximal Occurrence (LOMO) representation [33]. The ELF18 feature is the same as ELF6 which is computed from eighteen equally divided horizontal stripes histograms rather than six.

The comparison results are shown in Figure 5(a) and Table 3. It is observed that our method outperforms the three feature representation methods by a large margin which is over 40% at all ranks and again validates its effectiveness. It is notable that our method achieves 50.01% rank-1 matching rate, outperforming the gBiCov which achieved a 7.25% rank-1 matching rate, by a more significant sizeable margin than VIPeR. The main reason for its superior performance on CUHK01 is that there are less positive pairs in VIPeR dataset even though we have used data augmentation strategy. It still lacks training data to train a robust network. Compared with VIPeR, CUHK01 is larger in scale and has more training data to feed into the deep network to learn a data-driven optimal framework.

4.3. Comparison with Metric Learning Algorithms
4.3.1. Experiments on VIPeR Dataset

We evaluated the proposed algorithm and several metric learning algorithms, including ITML [43], Euclidean [38], LMNN [16], KISSME [18], and RDC [44]. The results of Cumulative Matching Characteristic (CMC) curves are shown in Figure 4(b). It can be seen that our proposed method is better than the compared metric learning algorithms. To present the quantized comparison results more clearly, we summarize the performance comparison at several top ranks in Table 4. Note that our approach achieves a 34.15% rank-1 matching rate, outperforming the performance of KISSME nearly 10% at all ranks. The main reason for its superior performance is that our proposed framework is capable of joint representation learning and SVM rather than requiring two-step separate optimization.

4.3.2. Experiments on CUHK01 Dataset

We compare our proposed method with the same methods which have been validated on the VIPeR dataset. Figure 5(b) plots the CMC curves and Table 5 shows the ranking results of all methods on the CUHK01. It can be seen that our method outperforms state-of-the-art methods with a rank-1 recognition rate of 50.01% (versus 29.40% by the next best method). Notice that the second best method on this dataset is KISSME. Our method performs best over 1, 5, and 10, whereas KISSME is better at rank-20 and rank-25. Even though KISSME got better performance on rank-20 and rank-25, our proposed method still performs well.

4.4. Comparison with Other State-of-the-Art Algorithms
4.4.1. Experiments on VIPeR Dataset

We compare the performance of our algorithm with the following approaches: KLFDA [34], PCCA [45], rPCCA [34], SVMML [46], MFA [16], SSCDL [47], eSCD [29], RankSVM [37], aPRDC [48], L1-norm [49], and L2-norm. Figure 4(c) and Table 6 show the CMC curves and the matching rate comparing our method with state-of-the-art methods. It is obvious that our method gives the best result among these algorithms which achieves 34.15% rank-1 matching rate, outperforming the result of KLFDA with 32.33%. The other better performing method on the VIPeR dataset is MFA which achieved 32.24% rank-1 matching rate. Our method performs best over ranks 1, 5, and 10, whereas KLFDA and MFA perform better over ranks 15, 20, and 25. The experiment results suggest that even though our model suffers from a severe lack of training data, it still achieves state-of-the-art performance on the highly challenging VIPeR dataset.

4.4.2. Experiments on CUHK01 Dataset

We compare our method with several state-of-the-art approaches on CUHK01 dataset, such as KLFDA [34], SVMML [46], MFA [16], SDALF [29], L1-norm [49], and L2-norm. As shown in Figure 5(c) and Table 7, our method achieves more significant outperformance than KLFDA and MFA in all ranks on the CUHK01 dataset rather than VIPeR. It suggests that the large train dataset will improve the learning ability of CNN network.

Experiment results on both VIPeR and CUHK01 datasets clearly indicate that our proposed CNN method outperforms these feature representation and metric learning algorithms, particularly when sufficient training data are provided. In our proposed method, feature learning is directly performed on the input images. Joint input branch of the lower level layers designed in the framework transforms the input images gradually into the higher-level representation with more refined features without dramatic feature reduction. The linear SVM classifier layer effectively measures the similarity of representations among the people appearances.

4.5. Comparison with CNN-Based Algorithms

In this section, we compare our method with five types of deep learning based person reidentification algorithms: FPNN [38], ImageNet + XQDA [40], FFN + XQDA [40], Deep_CNN [50], and DML [39]. ImageNet + XQDA algorithm is the combination of ImageNet feature and XQDA metric learning. We compare our method with it on both of VIPeR and CUHK01 datasets. FPNN and FFN + XQDA network model were trained on large-scale CUHK dataset because the other existing datasets are too small to train deep networks. Therefore, we compare our method with these two networks on CUHK01 and with DML on VIPeR dataset. It is notable that the train setting on CUHK01 of FPNN conducted in a different setting, with 871 pedestrians chosen for training and 100 for testing. Figures 4(d) and 5(d) and Tables 8 and 9 show the result of our experiments, and our method still achieves the best performance among these CNN-based approaches. The matching rate of our method on rank-1 outperforms ImageNet + XQDA more than 10% on both of VIPeR and CUHK01 datasets, far surpassing that of FPNN and Deep_CNN, which were only 27.87% and 12.5% separately.

4.6. Superiority of Joint Representation Learning

Many previous works on deep learning of person reidentification share the common input framework that they extract features from two images separately. As mentioned above, joint representation learning is easier to avoid features ignored and hidden when they are extracted independently. To validate the effectiveness of our proposed framework, we compare it with two branches on VIPeR dataset and CUHK01 dataset. The CMC curves in Figure 6 show that joint representation learning method consistently surpasses methods which have two branches, thereby demonstrating the good performance of our method depending on joint representation learning.

4.7. Superiority of Linear SVM Layer

In this paper, we introduce linear SVM to replace the traditional softmax activation function to measure the similarity of the comparing pair. We also perform experiments to evaluate the contribution of our linear SVM layer. We employ a softmax layer to replace the last linear SVM layer with the other layers left unchanged. In this way, the deep network is used to assess whether two input images belonged to the same person. The experiments are conducted on the CUHK01 dataset. The results in Figure 7 show that the linear SVM layer is more suitable for person reidentification problem than softmax layer.

5. Conclusion

In this paper, we present an effective linear Support Vector Machines network based on joint representation for person reidentification. The proposed model introduces L2-SVM to replace traditional softmax layer to deal with rank-like comparison problem. Instead of using the Siamese network to train a pair of input images, we use joint representation learning strategy to avoid designing new network architecture with two entrances. Extensive experiments on two challenging person reidentification datasets (VIPeR and CUHK01) demonstrate the effectiveness of our proposed approach. In the future, we intend to adapt our method on video sequence data and promote the efficiency of reidentification.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Unmanned Equipment Intelligent Control Support Software System (Grant no. 2015ZX01041101).