Abstract

Person reidentification (re-id) suffers from a challenging issue due to the significant inconsistency of the camera network, including position, view, and brands. In this paper, we propose a deep camera-aware metric learning (DCAML) model, where images on the identity-level spaces are further projected into different camera-level subspaces, which can explore the inherent relationship between identity and camera. Furthermore, we exploit dynamic training strategy to jointly multiple metrics for identity-camera relationship learning and thus consumedly elevating the retrieval accuracy. Extensive experiments on the three public datasets demonstrated that our method performs competitive results compared to the state-of-the-art person re-id methods.

1. Introduction

Person reidentification (re-id) attracts increasing research interests due to its significance in video surveillance. Although a noticeable improvement has been obtained in recent years, existing person re-id approaches still suffer from some challenging issues: (1) dramatic variations of visual appearance, (2) inconsistency in camera network, and (3) confusion between two similar pedestrians.

To address these problems, classical approaches generally focus on discriminative embedding learning or searching for effective similarity measurement. For example, semantic feature learning is studied via multistage ROI pooling in the work [1]. In the work [2], relations among individual body parts are explored through a GCP network. Additionally, metric learning aims to map semantically similar persons from some manifold onto metrically close person points in another space. In the work [3], an enhanced triplet loss is proposed to learn a distance metric between two pedestrian images. However, these methods are unable to discover the inherent relationship between identity and camera. Although Das et al. [4, 5] proposed a camera network reidentification approach which exploits the camera label information, however, the information was only exploited in their matching part and not utilized in the training stage. Lin et al. [6] exploited intracamera and intercamera consistent-aware information both in the training and testing stages. However, they ignore the inherent relationship between the camera and the pedestrian’s features.

Fortunately, we find that the learned features of one person contain no camera-level information. As shown in Figure 1, the images of some pedestrians captured from several cameras are visualized in the same space via TSNE, which performs a disorderly distribution of camera-level information. Different cameras have different geographical locations, viewpoints, and brands. Thus, the specific camera may provide camera-level discriminative information for personal identities, which is usually ignored by existing methods. Therefore, pedestrian discrimination might benefit from joint information of camera and identity.

In this paper, we propose a novel metric approach called the deep camera-aware metric learning (DCAML) model, where person features are projected into a unified identity space, and each identity space is modelled according to camera-level distribution. In this circumstance, the essential relationship between identity and camera can be discovered. Meanwhile, multiple metrics are exploited to formulate the learning of the identity space and camera subspaces. Instead of treating them as separate progress, a dynamic training strategy is developed to integrate them into one optimization objective. In addition, we further consider the hard samples for both intracamera and cross-cameras to improve training quality. Extensive experiments conducted on public datasets show the effectiveness of our method compared with the state-of-the-art approaches.

In summary, the main contributions of this paper include the following: (1)We propose a deep camera-aware metric learning (DCAML) model to discover the relationship between camera and identity, where camera-level and identity-level information jointly contribute to the retrieval accuracy(2)We develop a dynamic training strategy to integrate multiple metrics as a unified optimization objective(3)We introduce online hard example mining into DCAML to further improve the model performance

Person re-id is often viewed as a subproblem of image retrieval [7, 8]. Recently, with the use of deep learning in person re-id, the performance of person re-id methods has improved to an unprecedented level. Mainstream deep learning methods are divided into supervised and unsupervised learning. The supervised learning approach is adopted in this paper. There are two main classes of methods in the field of person re-id, feature learning and metric learning, which will be introduced in this section.

2.1. Feature Learning for Re-Id

Recent developments in person re-id adopt some form of localized representation learning to achieve improved performance on challenging datasets. For example, Zhao et al. [9] decomposed pedestrian images into different parts and extracted representations of parts and then aggregated as the overall representation. Li et al. [10] proposed to localize parts and learn part features through spatial transformer networks, then combining local and global features for classification. Su et al. [11] exploited pose information as a supervisory signal to learn normalized human part features. Meanwhile, the attention mechanism has been used in person re-id to tackle the localization problem. For example, Liu et al. [12] proposed a HydraPlus-Net network to extract low and semantic-level features for discriminative representations. Li et al. [13] proposed to simultaneously learn region-level and pixel-level attentive features for a multigranular representation. Li et al. [14] trained a predefined attention model for each specific body part and then aggregated them employing a temporal attention model. Additionally, to describe pedestrians with detailed information, a patch-based model [15, 16] slices person images into horizontal grids for better representations. To leverage human parts, aided by pose estimator, pose-based models [9, 13] extract pose maps to obtain part-level features. Another way is to compute attention maps for discriminative regions.

However, these methods do not consider the essential relationship between identity and camera, which may waste the annotations of the camera index. In contrast to them, we consider person representation from the perspective of identity and camera-level distributions.

2.2. Metric Learning for Re-Id

Inspired by the great success of deep learning in computer vision tasks [1720], many types of research integrate the feature and metric learning jointly in a unified deep framework, where the learning is under the supervision of the distance metric loss. For example, Ding et al. [21] presented a scalable distance-driven framework to introduce triplet loss into person re-id. Based on triplet loss, Hermans et al. [22] designed a variant of the triplet loss for end-to-end person re-id. Besides, compared to the triplet loss, Chen et al. [23] proposed a quadruplet loss to make outputs with a larger interclass variation and a smaller intraclass. Inspired by the hard sample mining method, Xiao et al. [24] proposed a new metric learning loss called margin sample mining loss using hard sample mining. However, all of the above methods do not take advantage of the intrinsic connection between the pedestrian picture and the camera to design the loss.

3. The Proposed Method

3.1. Problem Formulation and Overview

Given a probe image, the objective of the person re-id is to obtain a matched list of images from a gallery across different cameras. Define an image where is the camera label, is the identity label, and is the feature extracted by a re-id model.

Figure 2 shows the proposed backbone for feature extraction. We employ the pretrained ResNet50 model as the basic extractor where the last layer is removed and two extra fully connected (FC) layers are appended. The first FC layer reduces the embedding dimension to 2048, followed by batch normalization. The second FC layer reduces the dimension of the feature tensor to the class number as final outputs. Furthermore, our optimization objective includes ID loss and camera-aware loss. ID loss is to model the identity-level distribution, while camera-aware loss (CA loss) is to model the camera-level distribution. These two losses are integrated into a unified framework via a dynamic training strategy. Figure 3 illustrates the overall framework for ID loss and CA loss. During modelling the camera-level distribution, features under the same camera of loss are to model the identity-level distribution, while camera-aware loss is to model the camera-level distribution. These two losses are integrated into a unified framework via a dynamic training strategy. Figure 3 illustrates the overall framework for ID loss and CA loss.

During modelling the camera-level distribution, features under the same camera of the same identity are pulled closer, while features from different cameras of the same identity are pushed away in an appropriate distance. To this end, a minibatch consists of a series of quadruplets that are denoted as . We will introduce the quadruplet form in Section 3.3.

3.2. Camera-Aware Person Re-Id

Metric learning is widely studied for person re-id. The goal is to explore an effective mapping function that can map semantically similar person points from the manifold in into metrically close person points in . The is the parameter in mapping function and can be represented ranging from a linear transform to a nonlinear transform of convolutional neural network.

We define as a distance metric function in the embedding space. For convenience, we use the simple form while ignoring the parameter .

Ding et al. [21] investigate the distance relation between intraclass and interclass points, which aims to decrease the intraclass variation while increasing the interclass variation. They formulated it as a metric learning function named “triplet loss” to optimize : where is a tradeoff parameter between positive (intraclass) and negative (interclass) pairs. For an explicit definition, pulling person points of the same identity is defined as: while pushing the person points of different identities is defined as:

The whole optimization objective can be written as:

We observe that triplet loss can effectively set identity margins for different identities. Inspired by the form of “triplet loss,” we design a novel loss function called “camera-aware loss” (CA Loss). In detail, the motivation of the camera-aware re-id is to construct the bridge between identity and camera. Thus, a similar form of triplet loss can be used to learn appropriate margins for the modelling of camera-level distribution. The camera-aware loss can be written as:

where the means the probe image. The means the image which constitutes a positive pair with the probe image. The means the image which constitutes a negative pair with the probe image. The means the camera label of the image.

To alleviate the ambiguous representation caused by cross-camera variations, the underlying relation between camera and identity should be well mined. Given an anchor point under camera , the projection of a point under camera is closer to the anchor’s projection than those under another camera , where , by at least a margin for camera-level distribution. Via optimizing it on the whole dataset for long enough, all possible cross-camera variation pairs will be searched. In this case, camera-level distribution can be well modelled and cross-camera information can be well learned during training.

3.3. Batch Hard Pair Selection

A main difficulty of the CA loss is that as the dataset or camera number gets larger, the number of cross-camera pairs also grows quickly, rendering a difficult training process. Due to redundancy, the information of most cross-camera pairs is uninformative and trivial. Therefore, for the fast convergence of the model, it is crucial to select hard pairs that are most similar. Intuitively, we consider that the intracamera image of the same identity should be closer and more similar, where the outliers may be the hard one. Meanwhile, the cross-camera pairs of the same identity should be slightly dissimilar, and the most similar cross-camera pairs may be the hard ones.

In mathematics, given a person image under camera , we hope to select under camera that satisfies: and select under camera where that satisfies:

However, it is time-consuming to calculate the values of and under the whole training set. Besides, it may lead to a worse training process as the hardest images are usually noisy such as wrongly labelled or wrongly detected. To address this problem, we focus on the online hard example mining within a minibatch for calculations of and .

To achieve this goal, we define the intracamera pair as the positive pair and the cross-camera pair as the negative pair. For the effective representation of the CA loss, it needs to be ensured that positive and negative pairs of one identity are present in each minibatch together. Therefore, instead of random sampling for a minibatch, we construct a quadruplet form for each identity within the minibatch: where are of person under camera , are of person under camera .

To form a minibatch, identities are randomly sampled, thus resulting in a minibatch images. In each quadruplet, positive pairs are: while negative pairs are: where is the -th image of person under camera . As the training progresses, we notice that negative pairs grow larger than positive pairs, due to the limited images of one identity under one camera. Thus, hard positive pairs are not necessary. Therefore, we can mine the hardest negative pairs within the minibatch when computing the camera-aware loss, and we call it batch hard negative pair. The loss is written as: where is the number of identities in a minibatch, and corresponds to the-th image of the person under camera .

3.4. Multiloss Dynamic Training

Person re-id is aimed at identifying pedestrians across nonoverlapping cameras. The core task is the identification, so identification loss is the key component of optimization objectives. Camera-aware loss is used to alleviate the influence of the cross-camera variations via a bridge between identity and camera. These two optimization objectives can benefit from each other with explicit potential connections and can achieve a better generalization performance. (1)Identification loss: as a point-wise classification loss, identification loss adopts the cross-entropy form for identity prediction, which is defined as:where denotes the true identity of the input image . is the number of persons. is the softmax function, and is the weight matrix of the last FC layer for -th identity. (2)Dynamic weighing: how to integrate identification loss and camera-aware loss is a crucial problem due to the different effects. Most multitask researches weigh the different tasks by balancing parameters and formulate some tasks as the regularization items in the loss function. However, in our framework, (1) it is difficult to choose an appropriate parameter to fairly weigh two tasks. And (2) inappropriate parameter setting may produce negative effects on person re-id.

On the one hand, for the most identities, the intracamera variations are slighter than cross-camera variations. Thus, camera-aware loss only provides a small loss value for updating parameters, for which camera-aware loss contributes little to the learning if its weight is too small. On the other hand, in the early learning procedure, the re-id model needs to treat identification as the main task; otherwise, the camera-aware loss may influence the learning of the discriminative information. Besides, from the essence of the person re-id, these two tasks are conflicting when the weight of the camera-aware loss is too large, leading to an intraclass variance. Therefore, with the progress of the training, a camera-aware loss should play a progressive role.

In this work, we propose a progressive balance strategy to ensure the best combination of two losses. For identification loss, it requires no extra weights as the main part. For camera-aware loss, we define a gradient ascent method to estimate the weight of its growth. Suppose be the orders of magnitude for ID loss at the initial time. To approximate this order for a balance, we change the weight of camera-aware loss in a linear way to avoid loss oscillation. We can calculate the gradient of weight according to: where is the number of the . Based on the minibatch index , the weight can be written as:

In the case of weight increasing, the ratio of ID loss and CA loss is decreasing gradually. Obviously, at the initial time, is equal to zero with no effects on identification task learning. When the learning is progressing, the larger ensures that the re-id model can learn from the different viewpoints of one person. Finally, the overall objective can be rewritten as:

where the corresponds to the image of the person under camera .

4. Experiments

4.1. Experimental Settings

(1)Datasets: three public person re-id datasets are available for evaluation, including Market1501 [25], DukeMTMC re-ID [26, 27], and MSMT17 [28].

The Market1501 dataset is collected at the campus under 6 cameras. It includes 1,501 person identities with 19,732 testing images and 12,936 training images. We follow the standard evaluation protocol to ensure fair comparisons, which is defined as follows: (1) the fixed 750 identities are used as the training set to train the re-id model, and (2) 3,368 probe images are matched with the fixed gallery including 751 identities.

The MSMT17 dataset includes 15 cameras. 4,101 identities are captured with 126,441 labelled person boxes. It has 1,041 training identities and 3,060 testing identities. Besides, the person boxes are cropped from the video by the Faster RCNN detector. We adopt the standard evaluation protocol proposed in [26], which is defined as (1) the dataset is randomly split into two parts and (2) the training set and testing set are split according to the ratio of 1 : 3.

The DukeMTMC-reID dataset is a subset of the Duke Dataset. It has 16,522 images of 702 identities for training, while 2,228 probe images and 17,661 gallery images of the other 702 identities for testing. We follow the protocol proposed in [26], defined as follows: (1) 702 identities are randomly selected as the training set and the remaining 702 identities are as the testing set. (2) In the testing set, one image of each identity is randomly selected as the query under each camera, and the remaining images are in the gallery. (2)Parameters: our framework is implemented with Pytorch. The dimension of the embedding for matching is set as 512-dim, and the batch size is 32 for all datasets while the dropout rate is set as 0.5. Random cropping and resize are exploited for data argumentation, and the images are resized to . SGD is adopted as the 0.9. The initial learning rate is set as 0.05. Moreover, the margin is set as 0.1.(3)Evaluation metrics: The rank-1 and mAP (mean average precision) are used to visually show the model performance.

4.2. Comparison with the State-of-the-Arts

In this section, we compared the proposed framework with more than 30 state-of-the-art methods, which are proposed in recent years, on the three datasets. For the comparison on each dataset, we provide detailed results as follows. (1)Market1501: for this dataset, we compared two kinds of approaches including local feature and global feature approaches. It is illustrated in Table 1 that our method can perform the best accuracy scores on both rank and mAP compared with the global feature methods. Our method exceeds DML 9.9% on metric mAP and 5% on rank-1.

The results indicate that local feature-based methods perform better than the only global feature-based methods in general. Compared with local feature-based methods in Table 2, our method can obtain - best performance only using pre-trained ResNet50. Although MGN and PCB+RPP and HPM perform slightly better than ours, MGN and HPM explore both global and local information for person re-id, while PCB+RPP takes multiple parts for person re-id. Both MGN and PCB+RPP for person re-id. have a more complex architecture with multiple branches. Besides, our method can outperform PCB without RPP 0.4% on rank and 1.3% on mAP. (2)DukeMTMC-reID: As illustrated in Table 3, we can conclude that our method can exceed the most SOTA methods on this dataset on both metrics mAP and rank-1. The HPM method outperforms our method by 3.0% on rank-1 and 6.5% mAP but using both global and local information, while our method only uses global information. The part-aligned method outperforms our method by a small gap with 0.7% on rank-1 and 1.5% mAP but using part-aligned information, while our method only needs to extract a simple global feature without other operations. MLFN is a method to extract multi-level features that perform worse than ours on rank-1 (81.2% vs. 83.6%). Besides, our model exceeds PCB 0.3% on rank-1. In addition, our method also performs better than some attention based methods such as HA-CNN.(3)MSMT17: MSMT17 dataset is a recently released dataset that contains complex illumination, scenes, and background. Due to limited works for MSMT17, we merely compared the works presented in [28] that releases the MSMT17 dataset. As illustrated in Table 4, our method can elevate the identification accuracy without extra information. In detail, compared with the state-of-the-art methods, our method exceeds the GoogleNet compared with the state-of-the-art methods; our method exceeds the GoogleNet 2.9% on rank-1 and 10.1% on mAP and exceeds PDC 0.4% on mAP while approximate rank-1.

4.3. Ablation Study

To further discuss every component in our framework, we conducted a series of comprehensive ablation studies for the different submodules. The performance results at 2 metrics (mAP and rank-1) are shown in Table 5. Each result is obtained with only one submodule changed, and the rest submodules are the same as the original.

We only used the fine-tuned ResNet50 to extract the feature for a person re-id. And then we added batch hard negative pair on ResNet50 to test the performance. From Tables 15, we can conclude (1) BHNP sampling method is more useful than the random sampling method, which indicates that the cross-camera quadruplets are effective. Besides, via BHNP, overfitting is effectively avoided during training. (2) With BHNP and CM loss, the performance can further exceed that of the ResNet50, which indicates that CA loss is effective. The margins among camera embedding spaces can successfully reduce the confusing wrong matching pairs. And (3) dynamic weighting is important for the CA loss to product performance improvement. It indicated that the dynamic weighting strategy can improve the function of the CA loss. Without dynamic weighting, the CA loss may provide a negative influence on the performance.

4.4. Visualization Results

To directly indicate the effectiveness of our method, we visualized the same images shown in Figure 1, by TSNE and PCA. As shown in Figures 1 and 4), different shapes denote different identities. (2) Different colors denote different cameras. The different identities can be well classified by our method, while there exists obvious camera-level information in the same identity category. In fact, camera-level information also can provide help for identification because the visual representations captured from cameras contain the characteristics of the cameras, such as camera style, viewpoints, and scale.

5. Conclusion

In this paper, we propose a deep camera-aware metric learning (DCAML) model for person reidentification, where images on the identity-level spaces are further projected into different camera-level subspaces. We explore the inherent relationship between identity and camera. Furthermore, multiple loss functions are utilized to supervise the learning of the identity-level spaces and camera-level subspaces. In addition, we also consider joint multiple metrics for identity-camera relationship learning via a designed dynamic training strategy. Extensive experiments on the three public datasets demonstrated that our method performs competitive results compared to the state-of-the-art person re-id methods.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of China (U1803262, 61602349, and 61440016).