Abstract

Person reidentification (re-id) has gained significant progress and aroused great interest in computer vision. However, due to the effect of weak illumination and poor alignment, person re-id is still a challenging task. Many previous works focus on either illumination enhancement methods or pose estimation. However, those methods are difficult to apply in real-world scenarios, which usually contain various interference factors. To improve the performance of re-id, we propose an Illumination-Invariant and Pose-Aligned Network (IIPA-Net). The illumination change is handled by a retinex decompose network, and the pose variation problem is solved by a local feature matching method. Based on the multimodal nature of a person, we propose a part attention module to optimize the global feature. Finally, a data-driven training strategy is proposed to train the proposed architecture effectively. Experiments show that the proposed framework outperforms other state-of-the-art approaches on both normal- and low-light datasets.

1. Introduction

Person reidentification (re-id) is aimed at identifying a specific person (probe query image) from a gallery of candidate images captured by multiple cameras with overlap or nonoverlap fields of view. The increasing need for safety and security, combined with the growing availability of surveillance cameras, makes person reidentification an increasingly explored area [1]. However, it is very challenging since the interest person images captured by surveillance cameras usually have significant variations in different viewpoints, illumination, human pose, and so on [2]. Low resolution, partial occlusions, and blurring increase the difficulty of person re-id [3].

Since person images are captured by different cameras under unknown lighting conditions, the appearance of the same person contains various variants, making the re-id task extremely difficult. In order to eliminate the effect of illumination, many methods rely on the statistics of color distribution and project image to color constant space [4]. However, the prior information of lighting is unpredictable in real-world scenarios. An alternative solution is to simulate the real-world illumination and use data augmentation techniques, which is expensive and needs a lot of labeled data [5]. Pose misalignment, which is caused by changed viewpoint or inaccurate detection boxes, is another interference of person re-id framework [6]. A straightforward solution to this pose variation is to apply human pose estimation, which parses a person image into different semantic parts. However, pose estimation requires massive labeled data to train the model [7]. What is more, the re-id accuracy degrades substantially for inaccurate estimation. Figure 1 shows some examples of illumination change and pose misalignment.

Convolutional neural networks (CNNs), which have powerful representation and invariant embedding capabilities, have boosted the performance of person re-id [8]. CNN-based person re-id methods can be divided into two aspects: discriminative feature representation learning and deep metric learning [9]. In the first category, majority of the methods generally concentrate on extracting discriminative features, then formulate the person re-id as a classification problem [10]. In the second category, a robust metric between positive (the same) and negative (the different) persons is learned to deal with the matching problem [11]. In this paper, we focus on extracting discriminative feature representation. To achieve this aim, we propose a joint CNN framework that couples global and local feature learning to suppress interference, especially illumination and pose variations. Firstly, motivated by deep retinex illumination decomposition [12], we adopt a lightweight estimation to eliminate the effect of illumination and enhance the global person feature. Secondly, inspired by AlignedReID++ [13], which aligns local information to learn more discriminative features, we introduce a local feature matching to align different parts of person image, which is able to solve the pose variation problem. We find that the illumination-invariant feature can guide the local feature matching to align different person image parts. Thirdly, since the detected person has two significant modes [14], we concatenate the low-level feature of CNNs and the two-peak Gaussian map to design an attention mechanism. Consequently, the proposed IIPA-Net can boost the performance of the re-id in both normal- and low-light datasets. In summary, the contributions of this paper are threefold: (i)We build a novel network framework, which contains a retinex decomposition net and a weight-shared Resnet50 backbone CNN and achieves illumination-invariant and pose-aligned re-id(ii)We propose a part attention module to reweight the CNN output and extract the most informative parts of a person(iii)A data-driven training strategy is introduced to train the network effectively and speed up the training process

The main challenges of reidentification are changes in illumination, viewpoint, and pose across cameras. Many works focus on extracting the most discriminative visual feature of a person, including color [14], texture [15], and shape [16]. Kviatkovsky et al. [14] use shape context descriptors as a color-based signature to represent a person, which is divided into two significant modes. However, they assume that the silhouette of a person can be always obtained, which is not the case in real-world applications. Deep learning has revolutionized the techniques for person reidentification [17]. Li et al. [18] successfully apply deep learning to extract the features for person reidentification. Xiao et al. [19] propose a new deep learning framework that jointly handled both person detection and reidentification in a single convolutional neural network. Wu et al. [20] improve the discriminative feature representation of CNNs by exploiting unlabeled tracklets. The major limitation of this framework is that they either have handcrafted features or employ single scene images, thus making them less robust to various lighting conditions and changed human pose. Retinex theory is widely used for illumination estimation [21]. Many retinex-based re-id algorithms had achieved competitive performances [22, 23]. Specially, Liao et al. utilize the retinex transform and a scale invariant texture operator to handle illumination variations [23]. Huang et al. propose a retinex decomposition network to address the illumination variation problem and achieved a competitive re-id performance in low-light condition [22].

In [24], a new synthetic dataset, which contains hundreds of illumination conditions, is introduced to simulate the real-world lighting. The above methods reduced the adverse effects of illumination variant. However, they ignore the matching of local feature and failed to learn the aligned information, which effectively eliminate the influence of pose variant.

To reduce the negative impact of pose variant, some works apply human pose estimation to extract pixel-level body regions [8, 25]. Zheng et al. adopt the pose estimation confidence of input image to build a pose-invariant embedding (PIE) descriptor [8]. In [25], Zhao et al. represent a person with a discriminative feature, which is learned from different semantic regions of a person. On the other hand, some works focus on utilizing horizontal stripes or grids to extract pose-invariant features [13, 26]. Sun et al. design a Part-based Convolutional Baseline (PCB) network to learn discriminative part-level features [26]. Using the dynamic programming to match horizontal stripes of person images, Luo et al. propose a deep model to address the misalignment issue [13]. Additionally, Miao et al. propose an occluded person re-id framework by incorporating the pose information [27]. In spite of the great progress in re-id performance, the above methods still could be optimized by integrating the advantages of different architectures.

Different from existing frameworks, we focus on addressing issues of illumination and pose change simultaneously. Then, we propose a novel framework that is able to learn illumination invariance and pose alignment in a multitask manner.

3. Methodology

In this section, we firstly describe the retinex decomposition net and the part attention module. Then, the details of the proposed structure and training strategy are introduced.

3.1. Retinex Decomposition Net

To simulate the human color perception, retinex theory decomposes the observed image into two components: reflectance and illumination [21]. Mathematically, the source image can be denoted as follows: where and represent the reflectance and illumination components, respectively, and ° represents element-wise multiplication. The reflectance map described the intrinsic person property and is invariant to light change.

Thus, it is active to extract illumination-invariant discriminative features from the reflectance map. The illumination map, which represents various light environments, is harmful to re-id performance and ignored in this paper.

Unlike deep retinex net [12] that performs both reflectance and illumination decomposition to enhance low-light images, we only perform retinex decomposition net to extract the consistent feature of a person. As shown in Figure 2, the retinex decomposition net includes 8 layers. The first layer is a convolutional layer, which extracts convolutional features from the input image. The second to sixth layers are convolutional layers with a Relu activation function. The seventh layer is a convolutional layer which maps and from feature space. The last layer is a sigmoid function that normalizes and to .

To extract from different lightness images, the decomposition network is fed in paired normal/low-light images each time. During the training stage, the paired images, instead of their corresponding ground truth, are taken to train the retinex decomposition net. However, it can predict and in the test stage.

The loss for retinex decomposition net consists of reconstruction loss and invariable reflectance loss : where is used to balance the consistency of reflectance. The reconstruction loss is defined as where and denote the input low-light and normal-light images, respectively. and denote the reflectance and illumination of , as well as and of . The invariant reflectance loss is defined as

3.2. Part Attention Module

In order to extract discriminative features, many re-id methods introduce the attention mechanism to highlight the informative parts of person images, while suppressing cluttered background [9, 28]. The goal of the attention mechanism is to produce a saliency map to reweight CNN output. Given a 3-D , where , , and indicate the number of pixels in the channel, height, and width dimensions, respectively, the reweight process can be formulated as where is the reweighted map and is the output of the attention module. Combined with the state-of-the-art detector, there is an intuitive assumption that the detected persons lie in the middle of images. In real-world scenarios, a person usually has different clothing for lower and upper parts. Based on their multimodal nature, we introduce a two-peak Gaussian map , defined as Equation (6), to deal with the intradistribution of person appearance: where and represent the peak centers of the Gaussian map.

As shown in Figure 3, we concatenate and the 4th layer of Resnet-50. Subsequently, six convolution layers are added to extract the discriminative feature. Finally, a softmax classifier is implemented with a Fully Connected (FC) layer.

3.3. IIPA-Net Architecture

As shown in Figure 4, the proposed IIPA-Net can be divided into two parts: global branch and local branch.

For the first branch, the most discriminative image parts of a person are extracted by the part attention module. In the second branch, the person images are enhanced by preserving the reflectance map of retinex decomposition net. Both of the two branches are sent into the weight-shared Resnet50 backbone CNNs, which makes the proposed model more flexible and easy to train. The output of Resnet50 is a feature map, where represents the feature channel and is the spatial size. We extract a global discriminative feature vector using Global Average Pooling (GAP). Then, the global feature distance can be calculated by where denote the global feature of images and . The global feature is able to learn holistic information from the person image. However, it fails to address the pose-misalignment issue for the reason that the local representation is still unexploited. To learn the pose-aligned local feature, the output feature map of Resnet50 is transferred into size using horizon horizontal average pooling. Let and denote the local feature of images and . We can have the distance of the th vertical part of and th vertical part of as follows:

We further have the distance matrix , where its elements are . As described in [13], the local pose-aligned feature distance can be derived by dynamically matching local information (DMLI), which could dynamically align different part features. Finally, we obtain the total distance of and by

The total loss function of the framework is where and denote softmax loss and triplet loss [29] of the global feature and denotes the circle loss [30] of the local pose-aligned feature. The performance of different loss functions is described in Section 4.3.

3.4. Training the Network

Since there is a lack of explicit ground truth for the training part attention module and retinex network, it is difficult to optimize the network for various scenes. Therefore, we try to train the network in a date-driven way. The whole network is trained in four stages, as illustrated in Algorithm 1. (i)First, the backbone network, Resnet-50, is initialized by the ImageNet [31] pretrained model and trained to convergence under the supervision of triplet loss(ii)Second, the synthetic low-light image sets based on PASCAL VOC, together with their original images, are fed to the Retinex decompose network, as described in Section 3.3. This training step is finished after 200 epochs(iii)Third, all the layers in Resnet-50 are fixed; only the part attention module is trainable. Then, the IIPA-Net is retained with the softmax and triplet loss on the training set. The learning rate is decayed for 40 epochs(iv)Finally, we set all the layers trainable and fine-tune the IIPA-Net to convergence again

  1. The shared-weights Resnet-50 is trained to convergence with triplet loss.
  2. All synthetic images, together with their original images, are fed into the Retinex decomposition network.
  3. Parts attention module is trained using the training images set.
  4. The whole network is fine-tuned with Equation (10).

4. Experiments

4.1. Datasets and Evaluation Measures

Our experiments are based on two real-world and popular person re-id datasets: Market1501 [32] and DukeMTMC-reID [33]. To better present the advantages of the proposed illumination-invariant feature, we adopt two manual low-light re-id datasets named low-light Market and low-light Duke. The Market1501 includes 32,668 images of labeled people captured by six cameras. Specially, there are 12,936 images of 751 identities in the training set and 19,732 images of 750 identities in the testing set. The DukeMTMC-reID contains 25,272 images, which are extracted from the DukeMTMC dataset [34] captured by eight cameras. There are 6,522 images of 702 identities in the training set and 18,750 images of 1110 identities in the testing set. The low-light Market and low-light Duke are built from Market1501 and DukeMTMC-reID, respectively. Following [22], we use gamma correction to simulate low-light conditions. Each image in the datasets is processed with a gamma value, which is randomly picked from . Figure 5 shows examples of synthetic low-light images. To evaluate the performance of different algorithms, we use Cumulative Matching Characteristic (CMC) curves and mean Average Precision (mAP) [32] as the evaluation criteria. CMC is defined as a function of Rank- [35]. where represents the total number of person images in the gallery, and the query set is defined as

mAP is calculated based on the Average Precision (AP) and defined as where represents the precision-recall curve area of the th query and represents the size of the query set.

4.2. Experimental Setup

We implement all experiments using an Intel Xeon e5-2630 v3 2.4 GHz machine with 32 GB RAM and one NVIDIA GTX Titan 12 GB GPU. The training patch size is set to be 32; is set to be 0.001. is set to 1, when . Otherwise, is 0.001. Each input image is resized to . Random horizontal flipping and cropping tricks are preformed to augment data. We use Adam optimizer with learning rate .

4.3. Experimental Results

In this subsection, we firstly evaluate the part attention module. The two-peak Gaussian map can better guide the main body information of a person. Then, the effect of low light is analyzed. We can see that the low-light condition has a negative impact on pose alignment. Then, we evaluate the performance of our proposed IIPA-Net compared with other state-of-the-art re-id methods.

4.3.1. Evaluation of Part Attention

To better illustrate the effect of the proposed part attention module, we visualize the attention maps of the model with normal and two-peak Gaussian maps. In Figure 6, we can observe that the two-peak Gaussian map can pay attention to both upper and down parts of a person, while the normal one only to either upper (Figure 6(a)) or down (Figure 6(b)) part. The introduction of two-peak Gaussian makes part attention work more effective with the multimodal nature of a person. Figure 6 third columns show that the proposed part attention is able to produce similar predicted attention under different light conditions.

4.3.2. Effect of Low Light

As shown in Figure 7(a), using AlignedReID++ [13] as the baseline model, the fifth block of the left image is aligned to the fourth and sixth blocks of the right image and the distance of the two images is 0.7333, which is greater than the negative pair (0.5557). However, after decomposing the illumination, our proposed method is able to align the head, chest, foot, etc., of the positive pair images, and the distance is reduced to 0.4195, which is less than the negative pair (0.5775), as illustrated in Figure 7(b). The wrong connections of the baseline can be attributed to the negative impact of the low illumination. This indicates that the proposed approach eliminates the effect of weak illumination and learns the illumination-invariant features.

4.3.3. Performance of Different Loss Functions

We train four models with softmax+triplet loss (), softmax+instance [36] loss (), softmax+circle loss () and the proposed loss. The performance on Market1501 is presented in Table 1. and represent the loss of the global and local features, respectively. We can observe that Softmax+Instance and Softmax+Circle loss achieve the similar Rank-1 accuracy. Compared with Softmax+Triplet, the proposed loss improves the Rank-1 and mAP arropminately 0.3 and 0.2, respectively. We believe that the Circle loss works on some hard local features.

4.3.4. Comparison with State-of-the-Art

To evaluate the performance of the proposed IIPR-Net, we report the experimental results with some state-of-the-art methods. Our baseline is AlignedReID++ [13], which focuses on solving the pose change problem. In order to demonstrate the advantage of the proposed framework, we also report the results of baseline with a low-light enhancement method. Both training and testing image sets are enhanced with MSRCP [37] and then fed into the baseline.

As shown in Table 2, our proposed framework outperforms most state-of-the-art methods on all four datasets. Specially, the proposed framework achieves 96.2% Rank-1 for Market1501 and 90.8% Rank-1 for Duke MTMC-reID, outperforming other attention-based methods, i.e., MHN-6 [9] and DSA [38]. Although FlipReID [39] and st-ReID [40] achieve the best performance, the extra data, for instance, spatial and temporal information, are utilized to train the network. For low-light Market and Duke datasets, the Rank-1 accuracy of the proposed method is increased by 10.1% and 11.2%, and the mAP increased by 9.5% and 6.0%, respectively. This demonstrates that our joint framework not only eliminates the impact of low light but also explores pose-invariant local features for person re-id. Figure 8 depicts five examples of queries together with the top 10 retrieved results of baseline and IIPA-Net on the low-light Market dataset. As we can see, the IIPA-Net outperforms the baseline and accurately retrieves the target in spite of illumination and pose variants.

4.3.5. Ablation Study

To verify the contribution of each component, we perform the ablation study on normal- and low-light Market datasets. Table 3 shows the results of each component of IIPA-Net. We note that the attention component achieves better results on the Market1501 dataset. However, retinex is better in low-light conditions. The combination of the retinex and attention achieves the best performance on both datasets. The reason is that IIPA-Net is able to learn both illumination and pose-invariant features.

5. Conclusions

In this paper, we proposed a jointly illumination-invariant and pose-aligned learning framework for person re-id. Motivated by retinex theory, we introduce a retinex decomposition net to eliminate the impact of different lights and extract an illumination-invariant feature. To tackle the problems of pose alignment, dynamically matching local information is utilized to align local feature, which is transferred from the deep learning feature map. Based on the nature of a person, we proposed a part attention mechanism to extract the most discriminative global feature. The joint framework is trained in a four-stage fashion. Experiments demonstrate that the proposed framework achieves better performance on both normal- and low-light datasets. In the future, we will focus on long-term re-id scenarios which present more complex scene variations.

Data Availability

All data included in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China under Grant 52105268, Natural Science Foundation of Guangdong Province under Grant 2022A1515011409, Key Platforms and Major Scientific Research Projects of Universities in Guangdong under Grants 2019KTSCX161 and 2019KTSCX165, Key Projects of Natural Science Research Projects of Shaoguan University under Grants SZ2020KJ02 and SZ2021KJ05, Project of Guangdong Provincial Key Laboratory of Technique and Equipment for Macromolecular Advanced Manufacturing under Grant 2020kfkt07, and the Science and Technology Program of Shaoguan City of China under Grants 2019sn056, 200811094530423, 200811094530805, and 200811094530811.