Abstract

In person reidentification distance metric learning suffers a great challenge from impostor persons. Mostly, distance metrics are learned by maximizing the similarity between positive pair against impostors that lie on different transform modals. In addition, these impostors are obtained from Gallery view for query sample only, while the Gallery sample is totally ignored. In real world, a given pair of query and Gallery experience different changes in pose, viewpoint, and lighting. Thus, impostors only from Gallery view can not optimally maximize their similarity. Therefore, to resolve these issues we have proposed an impostor resilient multimodal metric (IRM3). IRM3 is learned for each modal transform in the image space and uses impostors from both Probe and Gallery views to effectively restrict large number of impostors. Learned IRM3 is then evaluated on three benchmark datasets, VIPeR, CUHK01, and CUHK03, and shows significant improvement in performance compared to many previous approaches.

1. Introduction

Person reidentification (Re-ID) matches a given person across a large network of nonoverlapping cameras [1], and is fundamentally used for person tracking in camera networks. Despite years of research, reidentification is still a challenging problem as the data space in Re-ID is multimodal (modal in our work is defined as the space which is formed by the joint combination of different changes a given pair images of the same person undergo in different camera views) and the observed images in different views undergo various different changes in poses [2], viewpoints [3], lighting [4], background clutter, and also experience occlusion.

Most approaches in Re-ID can mainly be divided into two categories: robust features extraction [513] for representation and globally learning distance metric for matching [14, 15]. These global metrics [1619] project features into low dimension subspace where they tend to maximize the discrimination among different persons; however, these metrics still suffer a great challenge from impostor (an impostor is a person that belongs to the other person and, however, possess higher similarity with the given query than the right Gallery sample) samples [20, 21]. Though, in past some attempts are made to eliminate impostors [14, 2022], however, all these attempts have not given due consideration of different transform modals on which the reidentification images lie [23]. This situation is illustrated in Figure 1, where we have shown three transform modals , , and in the image space. contains a positive pair (query and Gallery) enclosed in green rectangles for which a metric is learned, while there are two more pairs lying in modals and , respectively. View b images (enclosed in red rectangles) in and are similar to query in and, thus, are impostors for query sample. In conventional approaches [14, 2022], the metric between query and Gallery samples in is learned using the impostor sample from (Metric ) or (Metric ) as a constraint. Therefore, when the similarity for positive pair is learned under the constraint of an impostor person lying on a different transform modal other than the positive pair, then the learned similarity metric would not be the optimal matching function, which can be proved from poor retrieval results in Ranklist 1 and Ranklist 2 in Figure 1.

Further, in Figure 1 previous approaches [14, 2022] have used impostor samples for query sample only from the Gallery view, while totally ignoring the Gallery sample. Therefore, to resolve the above shortcomings in [14, 2022] we have proposed an impostor resilient multimodal metric, referred as IRM3, which eliminates the impostors largely and attains an optimal matching between positive pair. The objective of IRM3 is to maximize the matching of a positive pair against both the negative gallery samples (NGS) (samples which are not impostors and belong to different persons), as well as against impostors by taking into account the modal a given pair, its negative gallery samples, and its impostors reside. Further, in contrast to [14, 2022], it also takes into consideration impostor samples for both the query and its respective Gallery sample. This pair of impostors are referred to as Cross views impostors (CVI) which are obtained for query and Gallery samples from their opposite views and help in further maximizing the similarity between given query and Gallery samples. The contributions of our impostor resilient multimodal metric IMR3 are as follows:(i)Improving impostors resistance by jointly exploiting the transform modals [23], as well as impostor samples from both Probe and Gallery views;(ii)With our IMR3 approach a significant gain in performance is obtained in Multikernel Local Fisher Discriminant Analysis (MK-LFDA) [44].

2. Methodology

Figure 2 shows the framework of our IRM3. In Figure 2, first color and texture features are extracted from each training sample; then, different modals are discovered in the image space. These modals are discovered by using sum of squares clustering which is explained in Section 2.2. Finally, for each modal cross views impostors (CVI) (explained in Section 2.3) and negative gallery samples (NGS) (explained in Section 2.4) are generated to train the modal metric for each transform modal . In our work, the modal metric is learned using MK-LFDA [44], and the learning procedure is explained in Section 2.6. Finally, in Section 2.7 we have explained how we have performed matching between test query and Gallery.

2.1. Feature Extraction

RGB, HSV, LAB, YCbCr, and SCNCD histograms are extracted according to similar settings in [45] using 32 bins per channel, and settings in [12], respectively. Then, all five features are concatenated together. Similarly, DenseSIFT, SILTP, and HOG are extracted according to the settings in [46], [11], and [47], respectively, and are concatenated together. Dimension of color and texture features after concatenation become large, and since Re-ID data is multiview we have used CCA [48] to reduce dimension. However, to keep the local discriminative information of each type of feature we have applied CCA to color and texture features individually. By cross validation on VIPeR and CUHK03 we obtained optimal dimension for color feature to be 900, and texture feature to be 700. Finally, the reduced color and texture features are concatenated to form a feature vector of size 1600.

2.2. Partition Image Space

Let be the image space of a camera view; then iswhere is the feature representation of person and are the number of persons in . Since images in lie on different transform modals, therefore, there exist distinct clusters of different modals in . Each of these modal clusters has its own unique transformation and visual patterns; thus, all the persons belonging to a modal can be obtained using sum of squares clustering aswhere are the number of modals in , is scatter matrix of within transform modals, is the association of with transform modal , and is the center of the th transform modal.

In (2), each modal center is critical in discovering distinct, stable, and nonempty modals in . Thus, choosing any sample as center of any given modal , it is necessary to make sure it is a right choice. In order to make sure a chosen modal center is right it has to fulfill two conditions: First, (i) if the chosen sample is a center of modal cluster , then, all the persons in modal will be its neighbors, and it has the highest number of nearest neighbors. Second, (ii) center and all its nearest neighbors lie on the same modal; therefore, these neighbors will share similar patterns with the center in both Probe and Gallery views.

Now, we compute the number of nearest neighbors for each person in training set by taking into consideration the above two conditions. For this purpose, we have used both Probe and Gallery samples of each person to obtain four lists of neighbors, which are computed from both camera views. To acquire most reliable neighbors we then select only top@40 (here, the reason to choose top@40 neighbors is to maintain maximum reliability with minimum time and memory cost in large datasets. For instance, when we have modals in CUHK03, then, in each modal there will be at least 78 training persons. Now, to obtain a center sample of any modal it must have at least 51% neighbors in that modal, and thus, we take top@40 neighbors which is in actual 52% proportion of the training persons in a modal to find out whether is a center or not) (top@20 for VIPeR) neighbors from each list and then perform an intersection operation among all the four lists to obtain the cardinality value, as well as IDs of the neighbors which are common in both Probe and Gallery views of a given person. This cardinality value and the IDs of the obtained neighbors are then stored in a matrix. Further, this procedure is repeated for the rest of the remaining persons in the training set, and then their cardinality values, as well as IDs of the neighbors, are also stored in the same matrix.

Using this matrix we will now obtain our initial centers for modal transforms. These centers are chosen as the top persons with highest number of neighbors. However, it could be possible that two or more persons can have the same cardinality value, as well as share the same nearest neighbors IDs. In that condition simply choosing top persons will not be the best solution; instead, we chose only those top persons that do not have any person IDs common in their neighbors lists. In addition, for situations where more than two persons have the same cardinality and share same neighbors IDs, we randomly chose any one person from them to represent that modal center. Finally, getting the modal centers the optimal partitioning of the image space is obtained by minimizing the trace of within transform modals scatter matrix asThough the image space is partitioned into modals, however, to ensure the obtained modals are distinct and stable (in our work a stable modal is formed when it contains at least 15% training persons) we have updated the modal centers and repartitioned the space for further times. The modal centers are updated aswhere is the number of persons in modal and given asComputing the initial modal centers is computationally tedious in our work, however, it has still moderate computational burden. For the training size of persons the complexity is about , where is the number of iterations, and is the number of modals.

2.3. Cross Views Impostors (CVI)

After getting the distinct modals in the image space , we can now obtain the set of CVI for each positive pair () lying in modal from both of its Probe and Gallery views. We believe in real world situation (open set) where a positive pair has always limited or few samples; these CVI can be exploited to deliver subtle and differentiating information in metric learning that can differentiate a given pair more efficiently against large number of diverse real world impostors, as well as negative gallery samples. These impostors are obtained by comparing the similarity value of a given person pair against the other persons in Gallery and Probe views. First, the similarity values for a Probe sample are computed with the whole Gallery view using metric and CCA reduced feature aswhere and are CCA reduced feature of person and , while is a globally learned metric with feature using K-LFDA [45]. We have used linear kernel to save memory and computational time. Similarly, the similarity values for Gallery person are obtained with the whole Probe view asThese obtained values and for person in modal are then stored into two sets aswhere refers to the number of persons in a modal . Now, we compare each similarity value in these sets with the reference similarity value of a given pair to obtain its CVI set asand is the index of impostor person, and is computed asFurther, using (6)–(10), CVI set for all the persons in the modal are computed. The computational cost of generating cross views impostors for a modal is about , where .

2.4. Negative Gallery Samples (NGS)

We have also used negative gallery samples (NGS) to learn metric . Set of NGS, denoted as , for person pair are obtained from Gallery view only aswhere is the index of NGS Further, the set of NGS for all persons in modal is then obtained using (11).

2.5. Triplet Formation

Getting the set of CVI and NGS for all persons in modal we will now generate triplet samples to learn metric . Since the positive samples for each person are too scarce compared to the number of negative samples, therefore, following the protocol of data augmentation in [49] we augment each person pair five times. Similarly, following the protocol in [39] we generate 20 triplets for each positive pair. Now, the triplet samples and for person using impostor and negative Gallery are given aswhere and are taken from respective sets and of person .

2.6. Impostor Resilient Multimodal Metric (IRM3)

Taking triplets from and , metric IRM3 for modal is learned using MK-LFDA [44]; however, to save both the computational time and memory requirements we adapted [44] and use three RBF kernels and one kernel. The weights for these kernels are learned globally for once for each dataset in our work using the similar method in [44]. The reason to learn weights globally is to save both time and computational burden. Further, there is considerably minor effect on kernel weights; even the weights are learned globally. This is due to the fact that the global space is comprised of all the existing modals, and thus, all the modals contribute in learning the global weights. For learning weights of kernels all the extracted features are used individually, and the dimensions of these features are also individually reduced to 450 by CCA before learning weights. In all our experiments the obtained weights for VIPeR are , , and for RBF kernels, while weight for kernel is . For CUHK01 and CUHK03 the obtained weights for RBF kernels are , , and , while weight for kernel is . The values in all the datasets for the three RBF kernels are set to the mean value of modal , as well as (mean value + ) and (mean ). These values for are chosen to model all the different variations in the modal , while the value for kernel is also set to mean value of modal . The mean value in our work is the similarity value between Probe and Gallery samples of center . Finally, the metric is learned aswhere matrices and are obtained with similar method in [44]. Now, (13) is then solved using generalized eigenvalue problem [50] in (14) to obtain first = 300 eigenvectors corresponding to eigenvalues with largest magnitude as

2.7. Reidentification

From Figure 2, reidentification between test pair is performed by first determining the transform modal the test pair belongs to using K-NN classifier. In K-NN classifier, the parameter K is set to the number of modals in the image space; that is, in VIPeR the value of K is set to the number of modals . Then, the features of are projected into the weighted multikernel space of the respective modal, followed by the respective modal metric to perform matching as

3. Experiments

Our IRM3 metric is evaluated on three benchmark datasets: VIPeR, CUHK01, and CUHK03. We follow the evaluation protocol of [33] for test/train split for VIPeR, CUHK01, and CUHK03 datasets. However, in our work we have tested CUHK01 for only, while CUHK03 is tested for both Labelled and Detected settings. All the experiments are conducted in single-shot mode, and all the reported Cumulative Matching Curves (CMC) are obtained by averaging the results over 20 trials.

3.1. Experiment Protocols

To thoroughly analyze the performance of IRM3 we have devised three evaluation strategies. These strategies evaluate IRM3 performance with different number of discovered modals in , with Gallery view impostors (GVI) (GVI are the impostors from Gallery view only and are obtained in similar way as in previous conventional metrics [14, 2022]), as well as Cross views impostors (CVI).(i)IRM3 only: it is basic multimodal metric, learned with only Negative Gallery Samples (NGS).(ii)IRM3 + GVI (): IRM3 is learned with impostors from Gallery view (GVI), as well as with NGS Here refers to the number of impostors taken from Gallery view to form triplet samples and have values , 10, and 15, while the remaining triplets are formed using NGS(iii)IRM3 + CVI (): IRM3 is learned with CVI, as well as with NGS Here refers to number of CVI samples used to form triplets and have values , 10, and 15, while the remaining triplets are formed using NGS

All the samples from NGS, GVI, and CVI contain most difficult instances for a person and are randomly sampled off-line, before training metric. In all the three strategies above, we have partitioned image space into , 5, and 7 for VIPeR, while, for CUHK01 we have used , 7, and 10 partitions, and for CUHK03 , 14, and 16 partitions are used, respectively.

3.2. Results on VIPeR

Comparison with State-of-the-Art Features. Results of IRM3 metric are compared with three state-of-the-art features LOMO [11], GoG [25], and [24] in Table 1. All the results in Table 1 are obtained for modals, and our IRM3 + CVI ( = 15) has attained rank@1 52.81 and has outperformed all the three features of reidentification, providing evidence that if the metric can address multimodal transform variations well as well as have strong resistance against impostors then the matching accuracy can be improved. Our learned IRM3 + CVI ( = 15) considers optimizing all the rank orders simultaneously and, thus, has large improvement at rank@5 and rank@10.

Comparison with Metric Learning. We also compared metric IRM3 with 7 metrics. From Table 1 IRM3 + CVI ( = 15) has outperformed both multimodal metric LAFT [23] and impostor resistance metric LISTEN [21]. The prime difference between IRM3 and [21, 23] is its capability of addressing both the person modal transform, as well as capability of further maximizing the matching against joint constraint of cross views impostors. All these are the causes of great challenge in matching pedestrians. In Table 1 only SS-SVM [16] is a metric that tries to model the transform modal for each individual person; however, it never paid attention to acquire resistance against impostors and thus has 19.21 lower rank@1 accuracy than IRM3 + CVI (= 15). Though IRM3 has successful results, still it has 1.36 lower rank@1 than SCSP [38]. Obviously, VIPeR has large pose, misalignment, and body parts displacement issues which are specifically not addressed in our work and, thus, is necessarily needed to improve the matching and results largely.

Comparison with Deep Methods. Though, deep features (DF) and deep matching networks (DMN) have no match with conventional metric learning methods, however, from the results in Table 1 it is clearly evident if two major issues of reidentification (i.e., multimodal transforms, and strong rejection capability against impostors) can be well handled simultaneously, then comparable or even higher performance than deep methods can be attained. Our IRM3 + CVI ( = 15) has 7.1 and 4.94 higher rank@1 than Quadruplet-Net [33] and JLML [34], respectively. These obtained results demonstrate the fact that for smaller dataset like VIPeR deep matching networks have insufficient training samples to learn a discriminative network.

At last, Figure 3 shows the comparison of retrieval results of two queries from VIPeR dataset for XQDA [11] and our IRM3 + CVI () when modals are used. Retrieval results of Query 1 for XQDA find the correct match at rank = 4 enclosed in green rectangle (b), while IMR3 finds the match at rank = 2 enclosed in green rectangle (e). Similarly, for Query 2 our IMR3 finds the match at rank = 1 enclosed in green rectangle (j); in contrast, XQDA finds the correct match at rank = 3 enclosed in green rectangle (h). Thus, our IRM3 approach improves matching, and consequently rank gets higher.

3.3. Results on CUHK01

Comparison with State-of-the-Art Features. Table 2 summarizes results of IRM3 for = 10 modals and compares the obtained results with LOMO [11], GoG [25], and [24]. Though the three features are discriminative, however, our IRM3 approach is better than the three features in solving the two big challenges of Re-ID, that is, multimodal pedestrians matching and impostors resistance. Since CUHK01 has larger training set than VIPeR, thus, modal transforms can be well learned, and therefore, IRM3 + CVI ( = 15) attains larger discrimination than [24]. Our IRM3 + CVI ( = 15) has 15.15 higher rank@1 accuracy than due to inherent virtue of handling different modals, person specific variations, and rejecting large number of impostors, all simultaneously.

Comparison with Metric Learning. In Table 2 three most recently proposed metrics CVAML [40], WARCA [36], and L-1 Graph [37] are compared with our IRM3 approach. All the three metrics have assumption of unimodal intercamera transform, rather than multimodal image space. Though WARCA [36] employed hard negative samples as learning constraint, however, ignoring other negative samples from Gallery view and not taking into consideration a person modal during learning have made it suffer greatly to attain higher accuracy. On the other hand, IRM3 + CVI () has capability to deal all these challenges and, thus, has attained 76.14 rank@1 accuracy.

Comparison with Deep Methods. In Table 2, we can see several deep matching networks (DMN) have performed much well than conventional metrics on CUHK01. Only K-LFDA when trained with [24] feature attains comparable performance than DMN. However, motivated to resolve the challenges for reidentification in real world (i.e., multimodal image space, and diverse impostors) IRM3 + CVI () has much better results than MCP-CNN [39], E2E-CAN [31], Quadruplet-Net [33], and JLML [34], while our IRM3 + CVI ( = 15) has 1.49 higher rank@1 than DLPA [32]. DLPA extracts deep features by semantically aligning body parts, as well as rectifying pose variations. We believe if sematic body parts alignment and rectification of poses variations are included in our IRM3 then the results can be further improved.

3.4. Results on CUHK03

Comparison with State-of-the-Art Features. Table 3 compares LOMO [11] and GoG [25] features with our IRM3 metric in both Labelled and Detected settings. All the results in Table 3 are obtained for = 16 modals. In Table 3, obtained results are much higher than the two features. The primary reason of gain in performance for IRM3 against the features [11, 25] is mainly due to the difference in their approaches. In [11, 25] a universal feature representation is proposed for all the different persons, which may not be optimal for all the persons at the same time residing on different modals; in contrast, our motivation is based on discovering distinct modals in the image space and then addressing each modal specifically with empowerment of large number of impostors rejection. Therefore, our IRM3 + CVI ( = 15) (in Labelled setting) has rank@1 accuracy of about 86.17.

Comparison with Metric Learning. In Table 3, recently proposed WARCA [36] and SSM [43] are compared with our IRM3 approach. WARCA [36] differs with our IRM3 approach in a way that it only addresses hard negative samples, while SSM [43] differs in a way that it has no measure to account for different modal transforms, as well as having no resistance against impostors. Our IRM3 + CVI ( = 15) (in Labelled setting) has surpassed [36] and [43] and has attained 9.04 and 11.1 rise at rank@1 accuracy, respectively.

Comparison with Deep Methods. Interestingly, in Table 3 all the deep methods in Labelled and Detected settings have very high performance on CUHK03. These high results demonstrate the fact that CUHK03 is the largest dataset among all and, thus, can help in learning a more discriminative DMN. Even though both JLML [34] and DLPA [32] learn deep body features with global and local body parts alignment, as well as, pose alignment, however, our IMR3 approach benefitted with transform specific metrics empowered with impostors rejection still maintained to attain better results. Our IRM3 considers optimizing all the rank orders simultaneously and, thus, have large gain at rank@5 and rank@10 in Labelled setting.

3.5. Analysis

In Table 4, we analyzed the effect of number of modals in testing for VIPeR. Initially, we have partitioned image space into = 5 and then tested it without using any impostor sample ( = 5, = 0) to obtain rank@1 results of about 45.27%. As the more modals are discovered in the image space, such as , then the results get further improved even without using any impostor sample (, = 0), and rank@1 becomes 45.92%. The main reason behind this increment is the fact that now we could match more test samples correctly by using their actual modal transforms which were lost when the modals are less discovered in .

In addition, we could also see a positive increment in results when impostors from Gallery view are also added in learning metric. Both ( = 5, GVI ( = 15)) and ( = 7, GVI ( = 15)) have attained more higher differentiating capability than [14, 2022], as now, they can restrict impostors by taking into care transform modals a positive pair and impostors undergo.

Interestingly, in our work this impostor resistance can be further enhanced. This is done by using Cross views impostors (CVI). From Table 4, it is clear that even for same number of modals say , when (CVI) are used then the differentiation capability of ( = 7, CVI ( = 15)) gets further enhanced than ( = 7, GVI ( = 15)) and rank@1 becomes 52.81%. This increment in rank@1 provides a strong evidence that CVI have ability to maximize the similarity of positive pair more than GVI by taking into care both the transform modal, as well as various different changes a given query and Gallery samples undergo in different views.

At last, in Figure 4 we have provided a performance comparison at rank@1 when the modal centers are chosen randomly, as well as when the centers are obtained using our method in Section 2.2. Obtained rank@1 accuracy for random centers is poor, because these random centers are obtained just by simply choosing the top- persons without taking into care their reliability, stability, and IDs.

3.6. Efficiency

We computed the run time of our IRM3 approach using MK-LFDA [44], XQDA [11], and K-LFDA [45] (with kernel) on CUHK03. There are 1260 training persons and 100 testing identities. All the algorithms are implemented in MATLAB and run on server machine having 6 CPUs (Xeon(r)e5-2620) with each CPU having 6 cores and total memory size of 256 GB. In Table 5, training time of MK-LFDA [44] is faster than XQDA [11] but lower than K-LFDA [45]. However, in testing when the weights of kernels are not learned MK-LFDA [44] is faster than both XQDA and K-LFDA. These timing results support the fact that our proposed method is well applicable in real time applications and in public spaces.

4. Conclusion

This paper presents a metric learning approach that exploits both multimodal transforms and Cross views impostors to improve the capability of metric to differentiate among different persons, as well as enhance rejection capability to decline large number of real world diverse impostors. In real world mostly pedestrian images are multimodal, and in public spaces several persons share similar clothing; therefore, our IRM3 is learned to tackle such issues of reidentification and person tracking in public spaces. Extensive experiments on three challenging datasets (VIPeR, CUHK01, and CUHK03) demonstrate the effectiveness of our IRM3 metric which has outperformed many previous state-of-the-art metrics. In addition, we further intend to extend our approach for testing in real world scenario and intend to solve various other issues for real time implementation.

Conflicts of Interest

The authors declare that they have no conflicts of interest.