Abstract

This paper presents a multi-information flow convolutional neural network (MiF-CNN) model for person reidentification (re-id). It contains several specific multilayer convolutional structures, where the input and output of a convolutional layer are concatenated together on channel dimension. With this idea, layers of model can go deeper and feature maps can be reused by each subsequent layer. Inspired by an image caption, a person attribute recognition network is proposed based on long-short-term memory network and attention mechanism. By fusing identification results of MiF-CNN and attribute recognition, this paper introduces the attribute-aided reranking algorithm to improve the accuracy of person re-id further. Experiments on VIPeR, CUHK01, and Market1501 datasets verify the proposed MiF-CNN can be trained sufficiently with small-scale datasets and obtain outstanding accuracy of person re-id. Contrast experiments also confirm the availability of the attribute-assisted reranking algorithm.

1. Introduction

Person reidentification (re-id) refers to matching and recognizing the identities of pedestrians captured by multicameras with nonoverlapping views, which is significant to improve the efficiency of the security system. Owing to the low resolution of cameras, it is hard to obtain discriminative face features, so the current person re-id methods are mainly based on visual features of pedestrians, such as color and texture [1]. In practice, changes in viewpoint, pose, and illumination among different camera views, as well as partial occlusions and background clutters, pose a great challenge to person re-id [2].

Two principal person re-id methods are feature representation and metric learning. Feature representation seeks to find features with stronger discrimination and better robustness to represent pedestrians. Many kinds of features have been utilized for this, in which appearance features are the simplest and the most popular ones. Color, texture, and shape are the features that can be extracted for human appearance [3]in feature representation, such as HSV color histogram, LBP texture, and Gabor features, and then used for reidentifying people with similarity among pedestrian features. Attribute features are also widely used in person re-id. Common attributes include gender, length of hair, and clothing. These attributes are highly intuitive and understandable descriptors which have proved to be successful in several tasks, such as face recognition and activity recognition [4]. Although attribute features are complicated in terms of extraction and expression, they contain rich semantic information and are more robust to illumination and viewpoint changes. Therefore, the combination of attribute features and low-level features can effectively improve the accuracy of person re-id [5]. The metric learning methods employ the machine learning algorithm to learn a good similarity metric, which makes the feature similarity of the same pedestrian greater than that of different pedestrians.

In recent years, deep learning has shown great success in a variety of tasks in image classification and frequency domain [6], where CNN is particularly outstanding. Compared with the traditional methods, CNN has stronger feature learning ability, and the learned features are more intrinsically representative to the original data, so it has better performance in extracting image features. Two types of CNN models are commonly employed in the community. The first type is the classification model as used in image classification and the second is the Siamese model using image pairs or triplets as input [7]. Most of the existing public datasets of person re-id only contain thousands of pedestrian image samples; a small number of training samples can easily lead to overfitting, which limits the performance of person re-id model. In addition, the deep neural networks for person re-id are similar in structure; that is, the feature maps extracted by convolutional layer are directly fed into the next convolutional layer [812]. Such structure usually ignores the correlation among features of each layer, thus reducing the mobility of feature information to some extent. In the process of back propagation, as the number of layers in neural network deepens, the gradient update information may attenuate in exponential form and cause vanishing gradient problem.

This work proposes to develop a modified deep neural network model for person re-id that could reduce overfitting caused by the lack of training samples. Moreover, this work aims to improve the identify accuracy of person re-id network with assistance of pedestrian attribute recognition. To this end, contribution of this paper is three-fold: first, this paper designs a multi-information flow convolutional neural network (MiF-CNN) to solve the person re-id problem. The network contains a series of multi-information flow convolution structures which connect the input and output of each convolutional layer together, realizes the reuse of features, and enhances the feature information flow and gradient back propagation of the entire network. Second, this paper designs a person attribute recognition network (PARN) based on long-short-term memory (LSTM) network and attention mechanism. The PARN decodes pedestrian visual features extracted by MiF-CNN into attribute features and outputs the attribute words of each person. Third, this paper presents an attribute-aided reranking algorithm which rematches attribute features among samples to aid more positive samples rank higher in rank list so as to improve the identify accuracy further.

The rest of this paper is organized as follows. Section 2 reviews the state of the art for person re-id. Section 3 introduces the details of MiF-CNN. Section 4 shows the principle of the PARN. The proposed attribute-aided reranking algorithm is detailed in Section 5. The experimental results and analysis are given in Section 6. Finally, conclusion and future works are discussed in Section 7.

The early person re-id methods extract the manually designed features to represent pedestrians. Farenzena et al. divided pedestrian images into multiple areas and extracted three complementary kinds of features, weighted color histograms, maximally stable color regions, and recurrent high-structured patches. Then, match these features and measure the similarity between pedestrian pair [13]. Yang et al. proposed a novel salient color names based color descriptor (SCNCD), which was utilized to guarantee that a higher probability will be assigned to the color name near to the color. Based on SCNCD, color distributions in different color spaces were fused into feature representation for person [14]. Bazzani et al. proposed asymmetry-based HPE descriptor, which accumulated HSV histogram of multiple pedestrian images as a global appearance feature and detected patches portraying highly informative recurrent ingredient in local regions as local feature [15]. Wu et al. designed a novel gradient self-similarity (GSS) feature based on HOG to capture the patterns of pairwise similarities of local gradient patches. The combination of HOG and GSS achieved improvement in person re-id accuracy [16].

Apart from manually designed low-level features, attribute features that represent mid-level semantic information apply to person re-id as well. Compared with low-level descriptors, attributes are more robust to image translations [7]. Layne et al. labeled 15 binary attributes for the VIPeR dataset and trained SVM to detect attributes. They also learned a weighted L2-norm distance metric to fix each attribute and fused them with low-level visual features [17]. Wang et al. predicted complete attribute vector by exploiting both visual feature and marked attributes and obtained the overall ranking list by fusing the rank result from visual features and attribute vectors separately [18]. Chen et al. learned attribute of person by part-specific CNN and merged them with another identification CNN embedding in a triplet structure for person re-id task [19]. Wang et al. proposed a deep neural network that contains an auto-encoder model to learn hidden attributes of person from visual feature in an unsupervised manner, which alleviated the requirement of massive annotation [20].

Deep learning has become popular for solving person re-id problems in recent years. Ahmed et al. present a deep CNN architecture for person re-id. The architecture computed differences in feature values across the two views around a neighborhood of each feature location to add robustness to positional differences in corresponding features of the two input images [12]. Cheng et al. proposed a novel multichannel CNN. After the first layer of CNN, features were divided into four equal parts that aimed to learn features for the respective body part. The proposed CNN was trained with improved triplet loss function [21]. Lin et al. used ResNet [22] as the base network to learn low-level features and attributes jointly, and trained network with combining the person re-ID loss and attribute prediction loss [23]. Yan et al. proposed an attention block which learned par-level attention on different local regions, and integrated the proposed block into existing CNN structures for training with the identify loss [24]. Inspired by above works, this paper proposes a multi-information flow convolutional neural network to extract discriminative pedestrian features. In addition, this paper designs a person attribute recognition network based on LSTM and attention mechanism for improving person re-id results with assistance of attributes recognition.

3. Multi-Information Flow Convolutional Neural Network

The proposed MiF-CNN solves the person re-id problem with classification thought. The overall network structure is shown in Figure 1. The structure includes 2 shallow convolutional layers, 3 multi-information flow convolutional structures with novel connection pattern, fully connected layers, max pooling layers, and classification output layers. Low-level features of pedestrian images are extracted first by 2 shallow convolutional layers. After deeper multi-information flow convolutional structures, MiF-CNN extracted higher level features. The final discriminative pedestrian feature vectors are obtained after reducing dimensions by pooling layers and integrating by fully connected layers.

3.1. Features Extraction

In MiF-CNN, all convolutional filters are 3 × 3 with stride 1. Batch normalization and ReLU activation function are applied after each convolutional layer. The operation process of convolutional layer can be formulated aswhere is the feature map from the convolutional layer, is the convolutional output of , is the feature map from the convolutional layer, is the filter on the feature map in the convolutional layer, and represents the convolutional operation. The process of convolutional layers extracting features is that neuron on the j-th feature map in the convolutional layer sum each feature map after connecting and convolution by filter , and map the extracted features on feature map in the convolutional layer. is the ReLU activation function, which is formulated as . Because of batch normalization, bias is ignored.

3.2. Multi-Information Flow Convolutional Structure

In this structure, both output and input of the current convolutional layer are concatenated together and fed into the next convolutional layer; i.e., the input of each layer is the connection combination of outputs from all previous layers. The detail of multi-information flow convolutional structures is shown in Figure 2.

This connection pattern makes feature maps of each layers be reused by all subsequent layers in forward propagation process, which makes the whole CNN model learn more feature information of pedestrian images. It can be considered as a special “Data Augmentation” in feature maps so as to enhance the information mobility of the network. In back propagation process, gradient of input in each layer contains derivative of loss function with respect to input, which makes propagation of gradient more effective and network easier to be trained. In multi-information flow convolutional structure, the number of feature maps that each layer outputs is a constant value , so the number of feature maps that layers outputs is , where is the number of feature maps in the initial layer. Supposing the feature map of channel in the initial layer is , where , then the feature map of channel that initial layer outputs can be expressed aswhere is the weight of the initial layer. The output after activation function iswhere . The feature map that the layer outputs can be expressed aswhere , . The output of the layer after concatenate operation iswhere , represents the concatenate operation on channel dimension.

In the process of back propagation, supposing is the derivative of loss function with respect to . Due to containing and , it produces two parts of gradient as shown below:where is the gradient of output from the layer after activation function. is the gradient of output from the layer. The gradient of weight in the layer iswhere is the gradient of the convolution result in the layer, is the derivative of the activation function with respect to , and is the gradient of weights in the layer. The network utilizes to update weights of each layer, which is formulated aswhere is the learning rate. Gradient keeps back propagation to the layer. The gradient of is

As shown in equations (6) and (9), the loss function produces two flows of gradient with respect to outputs of each convolutional layer, which makes error information propagating more effective in network and restrains the vanishing gradient to a certain extent.

The proposed MiF-CNN includes three multi-information flow convolutional structures. Between every two multi-information flow convolutional structures, a middle pooling layer is applied to compares the redundancy features. Hyperparameter is set with a small value. When it comes to a new multi-information flow convolutional structure, is doubled. Such a design makes each convolution layer learn a small quantity of features and reduce the redundant features so as to optimize the efficiency of network. With deeper layers, the network can learn more high-level and complex pedestrian features and improve the final identification accuracy.

3.3. Loss Function

Current deep learning algorithms usually use cross-entropy loss as cost function, which is formulated aswhere is the ground truth of pedestrian categories in training set, is the parameter of the last fully connected layer, is the feature vector of training samples, k is the number of pedestrian categories in the training set, and M is the batch size.

However, in practice, when using cross-entropy loss merely, if the quality of extracted features is not good enough, it will lead to intraclass distance being greater than interclass distance. Aiming at this problem, Wen et al. proposed center loss in 2016 [25]. Combination of cross-entropy loss and center loss can enhance the discrimination and generalization ability of the network. Center loss is defined as follows:where is the center of the pedestrian feature and is the feature vector of pedestrian. Center loss minimizes the distance between feature and its center in order to reduce the intraclass distance. Center is updated with equation (12):where is the update rate, is 1 if prediction equals to ground truth, otherwise is 0. That is to say, center is updated only when network predicts correctly.

4. Person Attributes Recognition Network

The rank list of MiF-CNN is shown in Figure 3. In the incorrect identification results of rank 1 (pedestrian B and pedestrian C), there is a big difference in attribute features between the top-ranking negative samples and the query image, including gender, clothing, whether carrying handbag or not, and so on. Hence, this paper recognizes person attributes for improving accuracy of person re-id.

Based on Encoder-Decoder idea, Xu et al. [26] proposed a neural network model that can learn and generate the content of images. The model utilized CNN as encoder to extract features of images which were then fed into a recurrent neural network (RNN) for decoding into language captions of images. Inspired by that, this paper presents a person attribute recognition network (PARN) with LSTM and attention mechanism. The proposed PARN takes pedestrian features that are extracted by MiF-CNN as input and outputs the attributes information of pedestrian images. The architecture of PARN is demonstrated in Figure 4.

4.1. Input of PARN

In PARN, the input is the feature maps that are before the last fully connected layer in the MiF-CNN. The input feature is split into n feature vectors, each of which corresponds to a part of the image. Each feature vector is a DX-dimensional vector which is represented as

Referring to the natural language processing method, PARN transforms words in person attribute labels into word embedding. As a part of the input of PARN, each word embedding is a DY-dimensional vector:where m is the number of attribute words in each pedestrian image and is the word embedding corresponding to each attribute word.

For attribute words, common one-hot encoding considers each word as an individual, which ignores the correlation among words. However, word embedding represents each word as a continuous dense vector, which makes those correlative words closer in space.

4.2. Attention Mechanism

Attention mechanism has been widely used in natural language processing and computer vision. By measuring the correlation between the output and different parts of the input, attention mechanism gives different weights to different parts of the input, enabling the network to use more important feature information for prediction and reduce the dimension of input data [27].

In practice, a certain attribute of pedestrian is only corresponding to a certain part of the image. For example, when recognizing whether a pedestrian is wearing a hat, people only pay attention to the area above the head of the pedestrian usually, instead of other areas irrelevant to the attribute. Therefore, before feeding the image features into the LSTM network, attention mechanism is introduced to calculate the correlation between different positions of image features and the hidden state of LSTM at the previous time. The schematic diagram of attention mechanism is shown in Figure 5.

At time t, the fully connected layer and tanh function are used to integrate the information of the input feature vector and the hidden state of LSTM at the previous time, which is formulated aswhere and represent the fully connected layer weights of and , respectively. The attention weights is obtained by supplying softmax function on score vector :where represents the fully connected layer weights of . The output is the weighted sums of all :

Feature highlights the local features that are helpful to predict and suppress the other local features that make small contribution to prediction, which reduces the dimension of features to some extent and makes LSTM network focus on the part of input features that have greater correlation with prediction while recognizing person attributes so that improves the efficiency and prediction accuracy of the network.

4.3. LSTM

Normal RNN updates network parameters with back propagation, which easily suffers from the vanishing gradient when there is a long gap between relevant information and the current position to predict. LSTM network solves this problem well because of its own structure advantage. The architecture of LSTM unit is shown in Figure 6.

Here, c is the cell state for storing and transferring information, f is the forget gate which decides what should be abandoned in the cell state, i is the input gate which decides what should be stored in the cell state, represents a vector of new candidate values that should be added to the cell state, o is the output gate which decides what parts of the cell state should be output to the next time, h is the hidden state of LSTM, x is the input of LSTM, and t denotes the current time.

In PARN, at any time t, the input of LSTM consists of two parts: word embedding at the previous time and image features at the current time. The operation processing of LSTM in PARN is formulated aswhere W and b represent the weights and biases of each gate, respectively. Sigmoid function outputs a number between 0 and 1 to decide the quantity scale of values that go through these gates. Tanh function is the activation function of input. This design makes LSTM give different weights for information at different times so that it can choose what part of information to remember or forget in a long-term sequence.

The time step of LSTM in PARN is set as the number of attributes of each pedestrian image m. The output hidden state is calculated by a fully connected layer and softmax function at each time step to predict pedestrian attribute words.

It is worth noting that, at each time step, the input word embedding of LSTM is the one that LSTM learned at the previous time step, which takes advantage of the LSTM when solving the long-term sequence problem. Because of the correlation among attributes of pedestrian, like the pedestrian owning attribute “male” generally own attribute “short hair” and attribute “pants,” LSTM can utilize the previous attribute result to predict the current attribute more accurately.

4.4. Network Optimization

The loss function of PARN includes two parts, one is the cross-entropy between prediction and ground truth of network, which is expressed aswhere is the prediction value of the network, is the ground truth of pedestrian labels, m is the time step of LSTM, and is the total number of attribute words in each dataset. As shown in equation (19), the loss of a single image in single epoch is the cumulated loss value after m iterations. The other one is the loss function of attention mechanism as described in [40]:where represents the attention weights of the image feature and n is the number of parts in the image feature. The object function of PARN to optimize iswhere M is the batch size and is the rate of in the object function.

5. Attribute-Aided Reranking Algorithm

Given a probe image and gallery image set including N pedestrian images, the initial similarity distance between and is computed with the Euclidean distance aswhere and represent the features of and , respectively, that extracted by MiF-CNN. The initial rank list is obtained by sorting similarity distance in ascending order, where . If the top-1 gallery is the positive sample, it means rank 1 is correct. If the positive sample is not the top-1 gallery but the top-5 gallery, it means rank 5 is correct.

The attribute-aided reranking algorithm reranks the rank list according to the attribute feature similarity between and so that more positive gallery gets higher ranking in , which could improve the performance of person re-id. To be specific, when rank 5 is correct but rank 1 is not, the proposed algorithm distributes the initial feature score for top-5 gallery according to their ranking in , the higher the ranking, the higher the . Then, the proposed algorithm distributes attribute score for according to the number of their attributes that are the same as , the more same attributes, the higher . The total score of each gallery is , where is the score weight. The reranking rank list is obtained by reranking in descending order of their total score . For M query images, the whole processing of the attribute-aided reranking algorithm is shown as Algorithm 1.

Input: Initial rank list R0
Output: The reranking rank list
(1)for i = 1, 2, …, M do
(2)if rank 1 correct
(3)
(4)end if
(5)elif rank 5 correct
(6)Distributing initial feature score for top-5 gallery according to their ranking in
(7)Distributing attribute score for according to the number of their attributes that are the same as
(8)
(9)Reranking in descending order of their total score and obtain
(10)end if
(11)elif rank 10 correct
(12)Changing to , repeat 6–10
(13)end if
(14)else
(15)Changing to , repeat 6–10
(16)end if
(17)end for

6. Experiments

This paper evaluates the attribute recognition accuracy of PARN on person re-id public datasets first and then compares our results with other person attribute recognition methods. Then, this paper evaluates the performance of MiF-CNN on three people re-id datasets and the availability of the proposed attribute-aided reranking algorithm for improving the accuracy of person re-id. The analysis of experiment results is also given after comparing our results with the state-of-the-art person re-id methods.

6.1. Datasets and Evaluation Protocol

This paper evaluates the proposed methods on three challenging person re-id public datasets including VIPeR [28], CUHK01 [29], and Market1501 [30].VIPeR Dataset. It contains 1264 images of 632 persons. Because of the lack of image samples, low-resolution, a great change in pose, view, and illumination, it is one of the hardest difficult person re-id datasets. In experiment, the VIPeR dataset is randomly split into two parts: images of 316 persons for training and the images of remaining 316 persons for testing.CUHK01 Dataset. It consists of 3884 images of 971 persons which are captured by two cameras with different views. 485 pedestrians are randomly chosen for training and other 486 pedestrians for testing.Market1501 Dataset. It is one of the largest person re-id datasets that include 32268 images of 1501 persons. Each person has a number of images that are captured by six cameras with different views. The dataset is divided into two parts: 12936 images of 751 persons for training and 19732 images of 750 persons for testing.Evaluation Protocol. For the single-shot datasets VIPeR and CUHK01, cumulative match characteristic (CMC) is used to record the ranks of correct identify [31]. For the multishot dataset Market1501, apart from CMC, mean Average Precision (mAP) is utilized to evaluate the performance of the proposed methods.

6.2. Experimental Results on Attribute Recognition

The attributes in PARN refer to pedestrian identity-level attributes. For the VIPeR dataset, attributes include “gender,” “length of hair,” “lower clothing,” “upper clothing,” “backpack or not,” and “carrying anything or not.” For the CUHK01 dataset, attributes include “gender,” “length of hair,” “backpack or not,” “handbag or not,” “color of upper,” and “color of lower.” For the Market1501 dataset, attributes contain “gender,” “length of hair,” “wearing hat or not,” “lower clothing,” “backpack or not,” “handbag or not,” “length of sleeve,” and “length of lower clothing.” Each person in each dataset owns an attribute word list as shown in Figure 7.

This paper evaluates the attribute recognition accuracy of PARN on the Market1501 dataset and compares our results with outstanding person attribute recognition method APR [22]. The experiment results are reported in Table 1, where “L.slv” represents the “length of sleeve” and “L.low” represents the “length of lower clothing.”

It can be observed from Table 1 that PARN obtains superior recognition accuracy of 8 attributes, especially “length of hair,” “handbag or not,” and mean accuracy are higher than APR, whereas accuracy of other attributes is closer with APR. Comparison with APR demonstrates the outstanding attribute recognition performance of PARN which plays an important role in improving the accuracy of person re-id.

6.3. Experimental Results on Person re-id

This paper evaluates the performance of the proposed methods on three datasets. The experimental results of the proposed methods and other state-of-the-art methods are presented in Table 2.

As shown in Table 2, the proposed MiF-CNN model obtains a great performance among state-of-the-art methods, and on the contrary, comparison between MiF and MiF + PARN demonstrates that the proposed attribute-aided reranking algorithm is helpful to increase the person re-id accuracy. The improvement effect is especially obvious on VIPeR and CUHK01 datasets, where the identification accuracy of rank 1, rank 5, and rank 10 improves by 6.02%, 6.33%, and 2.22% and 4.94%, 4.94%, and 2.26%, respectively. The proposed MiF-CNN model with the attribute-aided reranking algorithm gets the best rank 5 accuracy on the VIPeR dataset, the best rank 1 and rank 5 accuracy on the CUHK01 dataset, and the best mAP on the Market1501 dataset among various methods.

6.4. Analysis of Experimental Results

Because of the small quantity of training samples in VIPeR and CUHK01 datasets, a deep CNN model is hard to be trained and easily suffers from overfitting. To handle this, the FT-JSTL + DGD method based on deep CNN learned deep features from multiple domains jointly by merging all the datasets together and fine-tuned the pretrained model on VIPeR and CUHK01 separately. The structure and hyperparameters of the deep CNN model in the M3TCP method need to be adjusted manually for adapting the scale of different datasets so as to overcome the training problem of the deep neural network.

In contrast, the proposed MiF-CNN has an immobile structure and can be trained on smaller datasets directly without any fine tuning. In spite of the deep structure, the model can converge quickly and well. MiF-CNN without the attribute-aided reranking algorithm outperforms FT-JSTL + DGD and M3TCP by 0.27% and 13.17%, respectively, at rank 1 accuracy on the CUHK01 dataset. It also outperforms FT-JSTL + DGD by 2.22% at rank 1 accuracy on the VIPeR dataset, which indicates that the proposed MiF-CNN model has an excellent ability of reducing overfitting and generalization and an outstanding performance in person re-id.

Figure 8 demonstrates the rank list of MiF and MiF + PARN. It can be seen that the MiF-CNN with the attribute-aided reranking method enables the lower ranked positive sample in rank list of MiF-CNN obtain a higher ranking so as to improve the accuracy of person re-id, which verifies the effectiveness of the attribute-aided reranking algorithm for improving performance of person re-id model.

7. Conclusion

This paper studies and discusses the training problem of deep neural network in person re-id task, and the using of pedestrian attributes for further improving the accuracy of person re-id. The proposed MiF-CNN model realizes the reuse of feature maps and gradient information, which enhances the feature mobility of network and improves the efficiency of gradient propagation. The designed person attribute recognition network uses an attention mechanism to measure the correlation between input feature maps and the hidden state of LSTM at previous time for reducing the dimension of features. It also employs LSTM to decode image features into pedestrian attributes. On the basis of the attribute recognition results, the attribute-aided reranking algorithm is presented, which rematches attribute features among samples to aid more positive samples rank higher in rank list so as to improve the identify accuracy further.

The experimental results on three public person re-id datasets indicate the outstanding performance of MiF-CNN model in person re-id. The attribute-aided reranking algorithm makes a major contribution to improve the accuracy of person re-id. In the future, more improvement and optimization will be done on pedestrian attribute recognition network. Moreover, the caption of person images or videos can be obtained by the LSTM model with natural language process ideas which make person re-id methods more significant.

Data Availability

The (attributes recognition accuracy on the Market1501 dataset and person re-id accuracy on VIPeR, CUHK01, and Market1501 datasets) data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61773105), the Natural Science Foundation of Liaoning Province (no. 20170540675), and the Scientific Research Project of Liaoning Educational Department (no. LQGD2017023).