Abstract

RGB-infrared (RGB-IR) person reidentification is a challenge problem in computer vision due to the large crossmodality difference between RGB and IR images. Most traditional methods only carry out feature alignment, which ignores the uniqueness of modality differences and is difficult to eliminate the huge differences between RGB and IR. In this paper, a novel AGF network is proposed for RGB-IR re-ID task, which is based on the idea of global and local alignment. The AGF network distinguishes pedestrians in different modalities globally by combining pixel alignment and feature alignment and highlights more structure information of person locally by weighting channels with SE-ResNet-50, which has achieved ideal results. It consists of three modules, including alignGAN module (), crossmodality paired-images generation module (), and feature alignment module (). First, at pixel level, the RGB images are converted into IR images through the pixel alignment strategy to directly reduce the crossmodality difference between RGB and IR images. Second, at feature level, crossmodality paired images are generated by exchanging the modality-specific features of RGB and IR images to perform global set-level and fine-grained instance-level alignment. Finally, the SE-ResNet-50 network is used to replace the commonly used ResNet-50 network. By automatically learning the importance of different channel features, it strengthens the ability of the network to extract more fine-grained structural information of person crossmodalities. Extensive experimental results conducted on SYSU-MM01 dataset demonstrate that the proposed method favorably outperforms state-of-the-art methods. In addition, we evaluate the performance of the proposed method on a stronger baseline, and the evaluation results show that a RGB-IR re-ID method will show better performance on a stronger baseline.

1. Introduction

Person reidentification (re-ID) is a process of retrieving the same target person from multiple different camera perspectives. It is widely used in video surveillance, security, and intelligent city applications and is an important problem in video surveillance. Due to its importance, reID has attracted more and more attention in computer vision [16]. However, re-ID depends on good lighting conditions, which will not always be satisfied in real word. For example, in night or dark environment, the visible cameras cannot capture effective appearance. Fortunately, most surveillance cameras can automatically switch from visible (RGB) mode to near infrared (IR) mode, which provides the possibility to study the RGB-IR crossmodality matching problems in real scenes.

Although the research of RGB-IR re-ID in the real world is very meaningful, it also has the same challenges as previous work. First, there is a large difference between RGB images and IR images in channel nature. RGB images are three-channel, while IR images are single-channel. Second, the wave length range of RGB and IR images is different, which means that it is difficult to identify the same person according to the color information. In addition, different poses, illumination, and viewpoint change may even lead intraclass distances larger than interclass distances, and this is also a great challenge in RGB-IR re-ID.

The idea of global and local alignment is to obtain the global information first and then highlight more fine-grained information as a supplement through local alignment. They complement each other and can achieve better results. Inspired by the global and local alignment method in re-ID, we introduce the idea of global and local alignment into the research of RGB-IR re-ID, so as to better solve the great challenges in RGB-IR re-ID.

In order to reduce the large crossmodality difference between RGB and IR images, the existing RGB-IR re-ID mainly uses feature alignment. However, only by matching RGB and IR images directly in the shared feature space, it is difficult to eliminate the huge difference between these two modalities. In order to solve this problem, this paper uses the Alignment Generative Adversarial Network (AlignGAN) [7], which combines pixel alignment and feature alignment, to generate the input images needed by our framework. AlignGAN consists of three components: pixel generator, feature generator, and joint discriminator. It can reduce the crossmodality difference in the pixel space and the intramodality difference in the feature space. At the same time, because of the existence of the joint discriminator, identity-consistency features can also be maintained. Using the output images of AlignGAN as the input images of our framework, we can reduce the crossmodality difference between RGB and IR images and the intra-modality difference caused by different poses, lighting, viewpoint change, and occlusion.

After considering the pixel alignment in pixel space, it is necessary to align the features in the feature space. In order to solve the problem that most existing works only focus on the global set-level alignment between the entire RGB and IR sets when learning feature alignment in feature space, which leads to some instance misalignment, we use the method of joint set-level and instance-level alignment Re-ID (JSIA-ReID) [8]. Firstly, the modality-specific features and modality-invariant features of RGB and IR images are separated by two modality-specific encoders and a modality-invariant encoder, and the modality-invariant encoder is used to map the images of different modalities into the shared feature space to perform set-level alignment. Then, new paired images are generated by exchanging modality-specific features, and the instance-level alignment is directly performed by minimizing the distance between paired images. In this way, feature alignment between RGB and IR images is well performed in feature space.

In addition, ResNet-50 is usually used in the research of RGB-IR crossmodality re-ID. However, considering that in the current deep network, more features are fused in space; with the deepening of network layers, the receptive field of feature map will gradually become larger, which makes the deep network obviously insufficient in fusing nonlocal information. In view of the above reasons, we use SE-ResNet-50 network [9] as the backbone network to automatically learn the importance of different channel features by focusing on the relationship between channels and effectively using contextual information. In essence, it is to do attention operation on the channel dimension. This attention mechanism enables the model to pay more attention to the channel features with most information, while suppressing unimportant channel features.

The major contributions of this work can be summarized as follows. (1)We propose a network framework (AGF) based on pixel alignment and feature alignment and jointly model two alignment strategies for RGB-IR re-ID task. In the aspect of pixel alignment, we use the AlignGAN method to generate the input images required by our framework, which reduces the gap between crossmodalities. In the aspect of feature alignment, we adopt a feature alignment method combining set level and instance level to generate paired images by separating and exchanging the modality-specific features of RGB images and IR images. At the same time, we solve the problem of set-level and instance-level alignment(2)In the aspect of network, we innovatively use SE-ResNet-50 network to replace the commonly used ResNet-50 network. It can not only obtains global features to better focus on contextual information but also automatically learns the importance of different channel features, which improves the performance of the network. To the best of our knowledge, it is the first time that SENet has been successfully applied to the research of RGB-IR re-ID(3)Extensive experimental results on the SYSU-MM01 dataset demonstrate that the proposed model performs favorably against the state-of-the-art methods as we know and achieves a significant improvement of 5.7% rank 1 and 4.0% mAP, respectively

2.1. Person Reidentification

Person re-ID aims to solve the problem of matching pedestrian images on disjoint visible cameras. It plays a great role in real-world video surveillance, public safety, and intelligent city; so, it has attracted more and more attention recently. According to different ideas, it can be divided into representation learning-based re-ID method, metric learning-based re-ID method, and local feature-based re-ID method. Representation learning-based re-ID method [10] can automatically extract robust person representation features from the original images by using convolutional neural network (CNN), so as to better verify whether person in different images are the same. The goal of metric learning-based re-ID method is to make the distance of positive sample pairs less than that of negative sample pairs. The commonly used methods to metric learning loss include contrast loss [11], triple loss [1214], quadruplet loss [15], and trihard loss [16]. However, both representation learning-based re-ID method and metric learning-based re-ID method directly carry out feature extraction and metric distance for image retrieval globally. Although the effect of re-ID has been improved, it is difficult to make a further breakthrough. In order to solve this problem, local feature-based re-ID method is gradually emerging. Its main idea is that global features can identify different person globally, while local features can highlight more detailed information locally. The combination of the two can achieve better results. The local feature-based re-ID method mainly reidentifies by introducing image segmentation [17], pose estimation [1820] attribute description [21], and so on. Miao et al. [20] solved the occlusion problem in re-ID by introducing pose estimation under the assumption that both probe and gallery images may be occluded. Lin et al. [21] proposed an attribute-person recognition (APR) network, which effectively combined local features and global features by adding attribute description to re-ID and then improved the performance of re-ID. With the rapid development of deep learning, re-ID has made more and more breakthroughs, and there are gradually appeared video sequence-based re-ID method, GAN-based re-ID method, crossmodality re-ID method, and so on. The video sequence-based re-ID method considers not only the content information of images but also the motion information between frames in the video, so as to improve the accuracy of re-ID. Wu et al. [22] proposed the exploit the unknown gradually (EUG) method, which gradually selected unlabeled samples with most reliable pseudolabels to be added to the labeled data to update CNN continuously, so as to better solve the problem of one-shot video-based person re-ID. Subsequently, Wu et al. [23] set a large number of data with insufficient confidence as index data on the basis of EUG and introduced them into it, completed further optimization and achieved better results. GAN-based re-ID method can solve the problems caused by camera changes [24], datasets differences [25], different pedestrian postures [26], and so on, which is very powerful. Crossmodality re-ID method is used to study person re-ID in different modalities. At present, RGB-IR re-ID is widely studied. This paper is based on the research on RGB-IR re-ID and inspired by the combined global and local feature alignment in the local feature-based re-ID method, the person re-ID carried out globally through the joint pixel alignment and feature alignment, and the person structure details are highlighted locally through the operation of adding attention mechanism to the channel through SENet.

2.2. RGB-IR Person Reidentification

This paper focuses on the crossmodality alignment, which has already been widely studied in the general computer vision field. For example, for the retrieval between text and image, Zhao et al. [27] first transformed the multiview problem into a single view hash problem through an end-to-end deep learning framework. For the retrieval between vision and audio, Wu et al. [28] proposed a dual attention matching (DAM) module, which queries the local feature of another modality in a bidirectional way through the global feature of one modality, and pays attention to the global and local feature alignment at the same time. However, these studies are retrieval in other fields of computer vision and cannot be directly applied to RGB-IR re-ID. RGB-IR person re-ID attempts to match RGB and IR images of pedestrians under disjoint cameras. In addition to the recognition difficulties caused by different poses, illumination, viewpoint change, and occlusions in traditional re-ID, the crossmodality difference between RGB and IR images brings new challenges to RGB-IR re-ID. In [29], Wu et al. collected a large RGB-IR crossmodality dataset named SYSU-MM01. This paper not only discussed three different network structures but also proposed a deep zero-padding method, which was used to train one-stream network towards automatically evolving domain-specific nodes in the network. Besides one-stream method, two-stream method is also very effective. In [30], Ye et al. proposed a two-stage framework including feature learning and metric learning (TONE+HCML). Firstly, the features of two modalities were extracted separately, and the shared layer was used to obtain unified features, and then metric learning was used to further improve the performance. In [31], Ye et al. proposed a dual-path end-to-end feature learning framework, which consists of two parts, one is a dual-path network for feature extraction, and the other is a bidirectional dual-constrained top-ranking loss for feature learning. Compared with HCML, it has the advantage of direct end-to-end learning without additional metric learning. In [32], Ye et al. further improved the bidirectional dual-constrained top-ranking loss based on [31] to bidirectional center-constrained top-ranking loss. Using anchor to centers instead of anchor to samples comparison can not only reduce the computational cost, but also preserve the properties to handle both crossmodality and intramodality variations. In [33], Ye et al. proposed a novel modality-aware collaborative ensemble (MACE) learning method, which handled the modality-discrepancy in both feature level and classifier level. In feature level, MSTN learns better features by mining shareable information in middle-level convolution blocks, which is very important for fine-grained recognition task. In classifier level, on the one hand, both modality-sharable and modality-specific classifiers are introduced to guide the feature learning; on the other hand, in order to make better collaborative ensemble learning among different classifiers, ensemble learning strategy and collaborative learning strategy are introduced. Recently, many studies started from GAN, which provided a new idea for RGB-IR re-ID. In [34], Dai et al. introduced a crossmodality generative adversarial network (cmGAN), which reduces the crossmodality difference between RGB and IR images. Most methods mainly use feature alignment to make up the gap between RGB and IR images. Recently, new researches have taken into account both pixel alignment and feature alignment. In [35], Wang et al. used image-level subnetworks to convert visible (infrared) images into infrared (visible) images to reduce modality discrepancy and then used feature-level sub-networks to embed features to reduce appearance difference. In [7], Wang et al. innovatively proposed an end-to-end alignment generative adversarial network (AlignGAN) based on pixel-level and feature-level constraints. This model is composed of a pixel generator, a feature generator, and a joint discriminator. By playing min-max game among the three components, it can not only alleviate crossmodality and intramodality differences but also maintain identity consistency. Inspired by the idea of [7], we combined pixel alignment and feature alignment to generate new IR images from RGB images in the dataset SYSU-MM01 through AlignGAN. The newly generated IR images and the original SYSU-MM01 dataset constitute a new dataset, which is the dataset to be used in our entire framework. So, starting from the dataset, we carry out pixel alignment and feature alignment to make up for the gap between crossmodalities. Recently, Ye et al. [36] proposed a Homogeneous Augmented Tri-Modal (HAT) learning method, which solved the trimodal feature learning from both multimodal classification and multiview retrieval perspectives. They put forward that learning from grayscale images generated from visible images can effectively enforces the network to mine structure relations across multiple modalities. As far as we know, this work has achieved the best results at present. This provides a new idea for bridging the modality gap in VI-ReID, which is worth learning.

2.3. Disentangled Representation Learning

Disentangled representation learning aims to extract the necessary parts from the data to form a more meaningful representation. In the single-modality re-ID task, the application of disentangled representation learning is generally to extract illumination-invariant features [37] or to separate foreground, background, and posture factors [38]. However, in the task of RGB-IR crossmodality re-ID, due to both crossmodality and intramodality discrepancies, this brings particular challenges to disentangle common identity information and the remaining attributes from RGB and IR images. In [39], Choi et al. proposed a hierarchical crossmodality disentanglement (Hi-CMD) method, which automatically disentangles ID-discriminative factors and ID-excluded factors from RGB and IR images to reduce crossmodality difference and intramodality difference. In [40], considering that the existing research embedded different modalities into a common feature space to reduce crossmodality difference, but ignored the specific features of different modalities, Lu et al. proposed a crossmodality shared-specific feature transfer algorithm (cm-SSFT) to solve the above problems. Firstly, input the images into the two-stream feature extractor to obtain the shared and specific features. Then, the intramodality and intermodality affinities are modeled based on the shared-specific transfer network (SSTN) and exploit the potential of specific characteristics of different modalities. In [8], Wang et al. proposed to decompose RGB images and IR images into modality-specific features and modality-invariant features in order to solve the problem that some instances are out of alignment between RGB images and IR images. By separating and exchanging modality-specific features between them, paired images that remain the same modality-invariant but have different modality-specific are generated, so that instance-level alignment can be directly performed by minimizing the distance between each pair of paired images. In recent years, the disentangled representation learning method has been widely used in RGB-IR re-ID. The specific information in different modalities can be mined better by disentangle representation, which provides a powerful weapon for reducing crossmodality difference and intramodality difference. The whole framework of this paper is based on [8], which generates crossmodality paired images and simultaneously executes feature alignment of global set-level and fine-grained instance-level, which is combined with previous pixel alignment of the data to reduce crossmodality and intramodality difference more effectively. Moreover, the recent disentangled representation learning is also very popular in audio-visual events. Wu and Yang [41] solved the problem of modality uncertainty caused by audio-visual asynchrony by exchanging crossmodality signals of different video and audio, and introduced contrastive learning to introduce temporal difference into aggregated features, so as to better temporal localization performances. The expansion of disentangled representation learning in audio and video events further illustrates the effectiveness and popularity of separating and exchanging modality-specific features.

2.4. Deep Architectures

Convolutional neural networks have achieved great success in visual recognition, and many attempts have been made on the basis of the original convolutional neural networks. VGGNets [42] and Inception [43] show that increasing the depth of network can significantly improve the ability of network learning. ResNet [44] proved that by introducing residual blocks, we can learn deeper and stronger network, and the effects of the network will become better. ResNeXt [45] and exception [46] used block convolution to increase cardinality. Deformable convolution [47, 48] designed deformable convolution to enhance geometric modeling ability. SENet [9] learns the importance of different channels by adding different weights to the channels. Since we consider the global context information and the connection among different channels, thus SE-ResNet-50 is adopted as our CNN backbone.

3. AGF NetWork

In this section, we introduce the details of AGF network proposed for RGB-IR crossmodality re-ID. As shown in Figure 1, our proposed AGF consists of an AlignGAN module (), a crossmodality paired-image generation module (), and a feature alignment module (). Firstly, according to [7], we can generate IR images which can be confused with the real ones. The purpose of this is to eliminate the huge crossmodality gap between RGB-IR crossmodality images directly from the pixel level perspective. This work is the preliminary work of our entire framework, which can be simply understood as image preprocessing. Then, the IR images generated by AlignGAN and all the images in SYSU-MM01 dataset are transmitted to the framework as input images. Through the Generation Module () and the Feature Alignment Module (), the unpaired images are finally generated into paired images by separating and exchanging features, so as to simultaneously perform global set-level and instance-level alignment [8]. In addition, we also use SE-ResNet-50 network [9] as the backbone of CNN, paying attention to channels, learning the importance of different channels, and improving the network performance.

3.1. AlignGAN Module

In this module, our aim is to generate the required images as the input images of the whole framework according to the method AlignGAN provided by [7], reducing the crossmodality gap from the dataset source. As shown in Figure 2, AlignGAN consists of three parts: pixel alignment module (), feature alignment module (), and joint discriminator module (). The function of the pixel alignment module is to reduce the crossmodality difference between RGB and IR images and to convert real RGB images into fake IR images by pixel alignment module, so as to reduce the gap between RGB and IR crossmodality. The generated fake IR images can keep the original RGB identity information unchanged while having IR style. The function of feature alignment module is to reduce the intramodality difference and encode the real infrared images and the fake infrared images generated by pixel alignment module into the shared space, so as to reduce the intramodality difference caused by different poses, viewpoint change, lighting, and so on. The function of the joint discriminator module is to discriminate the authenticity of the input pictures while maintaining identity consistency. Taking the image-feature pair as the input of the discriminator, the output is either 0 or 1, where 0 means false and 1 means true, and only when the paired real infrared images and infrared features are taken as the input, 1 will be output.

3.2. Crossmodality Paired-Image Generation Module

In this module, images are decomposed into modality-specific features and modality-invariant features, and paired images are generated by exchanging modality-specific features of unpaired images. Two paired images have the same modality-invariant features such as pose, but with different modality-specific features such as clothing colors. The crossmodality paired-images generation module is composed of three encoders , , and and two decoders and .

The encoders are responsible for disentangling the features of RGB and IR images. Specifically, the modality-invariant encoder responsible for learning the content information of RGB images and IR images, the modality-specific encoders , and are responsible for learning the style information of RGB images and IR images, respectively. The modality-specific features and of RGB images and IR images are shown in equation (1), and the modality-invariant features and are shown in equation (2).

And the decoders are responsible for generating paired images by exchanging modality-specific features. Specifically, IR images are generated by using the content features of real RGB images and the style features of real IR images, which contain both the content information from RGB images and the style information from IR images and are paired with real RGB images. Similarly, we can also generate RGB images to be paired with real IR images. The whole process can be expressed by equation (3).

In order to generate more realistic paired images, [8] has made three efforts, including the following:

Firstly, construct a reconstruction loss to make the disentangled features can reconstruct their original images, where is L1 distance.

Secondly, a cycle-consistency loss is introduced to guarantee that the generated images can keep the original modality-invariant features and be translated back to the original version. The cycle-consistency loss is shown in equation (5), where the and are cycle-reconstructed images, which are specifically expressed in equation (6).

Finally, due to the introduction of reconstruction loss and cycle-consistency loss, the images will be blurred; so, adversarial loss is applied to make the images more realistic. Specifically, the discriminators and are used to distinguish the real images and the generated images on RGB and IR modalities, while encoders and decoders are used to make the real images indistinguishable from the generated images, so as to achieve the purpose of making the generated images more realistic. The expression of GAN loss is shown in equation (7).

3.3. Feature Alignment Module
3.3.1. Set-Level Feature Alignment

In the crossmodality paired-image generation module, modality-invariant encoder is trained to explicitly remove modality-specific features. The weight of set-level encoder is shared with modality-invariant encoder; thus, it can map different modality images with removed modality-specific features into the shared feature space and reduce modality difference between set level.

3.3.2. Instance-Level Feature Alignment

The instance-level encoderaligns the paired images in pairs to directly solve the problem of instance imbalance. Specifically, the instance-level encoder maps the set-level aligned features into a new feature space and then aligns every two crossmodality paired-images by minimizing their Kullback-Leibler Divergence. The loss of instance-level feature alignment is shown in equation (8).

where and are the predicted probabilities of and on all identities, and are the features of and in the feature space , and is a classifier implemented with a fully connected layer.

In addition, an identity-discriminative feature learning includes classification loss and triplet loss to overcome the intramodality difference. where represents feature vectors , is the predicted probability predicted by the classifier that the input feature vector belongs to the groundtruth, and are a positive pair of feature vectors belonging to the same person, and are a negative pair of feature vectors belonging to different persons, and is a margin parameter and .

Thus, the overall loss can be formulated as in equation (11).

We set , , and , and is decided by grid search.

3.4. Network Module

In the traditional convolutional neural network, each convolution kernel only operates on the local receptive field; so, each unit of convolution output cannot use the contextual information outside the region. In fact, every pixel of a picture may relate to each other, and the network of local receptive field ignores the related information between global pixels, which makes the experimental results unsatisfactory. However, SENet cannot only obtain global features and make effective use of contextual information but also make information interaction among channels possible by adding processing between two adjacent layers, so that the model can automatically learn the importance of different channel features and further improve the network accuracy. A detailed description of SE-ResNet-50 is given in Table 1 for a specific example of the SENet architecture.

The main purpose of SE module is to improve the sensitivity of the model to the characteristics of channel. This module is lightweight and can be applied to the existing network structure, which can improve the performance with only increase a small amount of calculation. It consists of two parts: squeeze part and excitation part. Squeeze compresses the feature map with the size of to , which represents the corresponding global distribution of feature channels, so that the lower layers can also obtain the global receptive field, thus obtaining global features. With the information of channels, it is necessary to establish the correlation between channels. The excitation part predicts the importance of each channel and adds it to the corresponding channel, so as to weight different channels. Essentially, SE module does attention or gating operation on the channel dimension. This attention mechanism allows the model to pay more attention to the channel features with the most of information, while suppressing the unimportant channel features. A diagram illustrating the SE block structure is shown in Figure 3 [9].

In the RGB-IR crossmodality scene, the color information of pedestrian clothing is almost unavailable. Thus, the model should pay more attention to the structure information of person so as to better complete the RGB-IR re-ID task. SE module can obtain the global features of RGB and IR images through Squeeze operation, and weighting channels through excitation operation can enlarge the dependence on texture/shape features and restrain the dependence of model on color features. Therefore, adding SE module to the network is helpful to extract finer-grained contour features of pedestrians and enforce the ability of the network to mine structure relations across RGB and IR modalities, making it robust to color variations. In view of the above reasons, the SE-ResNet-50 network [9] obtained by applying SE module to ResNet-50 network is used as CNN backbone, so as to obtain the global understanding and channel relationship of images in RGB-IR re-ID.

4. Experiment

4.1. Dataset and Evaluation Protocol
4.1.1. Dataset

We evaluate our model on the standard benchmark SYSU-MM01.

SYSU-MM01 [29] is a popular RGB-IR re-ID dataset, which includes 491 identities from 4 RGB cameras and 2 IR cameras. The training set includes 19659 RGB images and 12792 IR images of 395 persons, and the test set includes 96 persons. According to [29], there are two test modes, namely, all-search mode and indoor-search mode. For the all-search mode, all images are used. For the indoor-search mode, only indoor images from 1st, 2nd, 3rd, and 6th cameras are used. For both modes, the single-shot and multishot settings are adopted, respectively, in which single-shot setting randomly selects one image of a person to form the gallery set, while multishot setting randomly selects ten images of a person to form the gallery set. In both modes, IR images are used as probe set and RGB images as gallery set.

In this paper, the dataset used in our training consists of two parts, including all the images in SYSU-MM01 dataset and IR images converted from RGB images taken by the four RGB cameras including 1st, 2nd, 4th, and 5th in SYSU-MM01 dataset by the AlignGAN model. Our dataset is shown in Table 2.

4.1.2. Evaluation Protocols

The cumulative matching characteristic (CMC) and mean average precision (mAP) are used as evaluation metrics. After [29], the results of SYSU-MM01 are evaluated with the official code, which was based on the average of 10 times repeated random split of gallery and probe set.

4.2. Implementation Details
4.2.1. Network Architecture

In the generation module , [8] constructs modality-specific encoders, which has two strided convolution layers; then, a global average pooling layer and a fully connected layer. For decoders, four residual blocks with adaptive instance normalization (AdaIN) and two upsampling with convolution layers are used. Here, the parameters of AdaIN are dynamically generated by modality-specific features. In GAN loss, discriminator and LSGAN [49] are used to stabilize training. In the feature learning module , we use SE-ResNet-50 as our CNN backbone, the first two layers of the SE-ResNet-50 as our set-level encoder, and the remaining layers as our instance-level encoder.

4.2.2. Training Strategies

Our model is implemented with Pytorch. GAN’s input images size is set to [128, 64], and Reid’s input images size is set to [256,128]. Application of random horizontal flip for data enhancement: the whole training process is set to 649 epochs, the PK sampling parameters of GAN are set to and , and those of Reid are set to and . Adam optimizer with hyperparameters of GAN’s , , , and is adopted for optimization. In the crossmodality paired-image generation module, the learning rates of the generator and the discriminator are both set to 0.0001, and in the feature alignment module, the learning rates of set-level alignment and instance-level alignment are set to 0.00045.

4.3. Comparison with State-of-the-Arts

To prove the effectiveness of our method, we compare it with most related methods, including most advanced RGB-IR crossmodality re-ID methods, zero padding [29], BCTR [30], BDTR [31], eBDTR [32], cmGAN [34],[35], MAC [50], and AlignGAN [7], which considering both pixel-level differences and feature-level differences and JSIA-ReID [8], which performs set-level and instance-level alignment simultaneously, and some feature learning methods, HOG [51], LOMO [52], single-stream, and double-stream networks [29]. The experimental results are shown in Table 3. Our method is obviously superior to most of the most existing state-of-the-arts. On SYSU-MM01 dataset, compared with JSIA-ReID [8], we always outperform it, with a matching rate of over 5.7% on rank 1 and over 4% on mAP. Specifically, we achieved and on SYSU-MM01 dataset. This demonstrates the effectiveness of our model for the RGB-IR re-ID task.

4.4. Model Analysis
4.4.1. CNN Analysis

For a fair comparison, most frameworks adopt the ResNet-50 as CNN backbone in the research of RGB-IR re-ID task. However, because CNNs play an important role in the research and development of person reidentification, we believe that there exist still a large space for the study of CNN backbone network in the field of crossmodality person reidentification. Therefore, we choose ResNext-50, IBN-ResNet-50, SE-ResNet-50, and ResNet-50 for comparative analysis. The reason for choosing them is that these networks are not only closely related to ResNet-50 but also have their own characteristics; so, it is worth exploring and analyzing. Firstly, ResNext-50 introduces cardinality into the original ResNet to change one-way convolution into multiway parallel convolution and achieves the goal of improving the accuracy of the model without increasing the complexity of parameters by grouped convolutions. Secondly, IBN-ResNet-50 is obtained by applying IBN-Net to ResNet-50. IBN-Net inherits the advantages of IN and BN and can easily learn the visual expression of shallow network and the content information of deep network, and it is very suitable for crossdomain transfer learning. Finally, SE-ResNet-50 is generated by embedding SE module into ResNet-50. Unlike ordinary convolutional neural networks, which only pay attention to spatial information and ignore channel information, SENet adds attention mechanism to channels by modeling the correlation between feature channels and enhances the accuracy by strengthening important features. The comparison results are shown in Table 4.

The comparison results in Table 4 show that the experimental results of SE-ResNet-50 are obviously superior to the other three networks. Taking all-search and single-shot mode as an example, the rank 1 of SE-ResNet-50 is 39.6%, and the mAP is 37.1% on SYSU-MM01 dataset, and the rank 1 of SE-ResNet-50 is 43.8%, and mAP is 40.9% on our dataset. The results on two datasets show that SE-ResNet-50 has the best effect. This shows that SE-ResNet-50 as the CNN backbone can make our framework play the best effect, and it also shows that SE-ResNet-50 adds different weights to different channels, which is very effective for RGB-IR re-ID research.

4.4.2. Ablation Study

In order to further analyze the effectiveness of our proposed methods, we conducted an ablation experiment; that is, without adding any modules we wanted, we added the methods we wanted to study one by one. For example, the AlignGAN module is added separately to preprocess the data, and the SE-ResNet-50 network is added separately to pay attention to the connection between different channels. Finally, all modules are added to the basic network for experiments, and the effectiveness of the research method is verified by the experimental results.

As shown in Table 5, when all modules are removed, that is, baseline is the framework network of JSIA-ReID [8], and rank 1 score is 38.1%. The rank 1 score is 39.6% after adding SE-ResNet-50 module alone. The AlignGAN module is used alone to convert RGB images into IR images, which are added to the classic SYSU-MM01 dataset to form a new dataset, and the rank 1 score is 41.2%. It is proved that the addition of these two modules has obvious effects on RGB-IR crossmodality re-ID. Finally, using both the AlignGAN module and the SE-ResNet-50 module at the same time, our method obtained a rank 1 score of 43.8%, which shows that the AlignGAN module and the SE-ResNet-50 can be effectively combined, and it has a good effect on the matching of RGB-IR crossmodality re-ID.

4.5. Baseline Analysis

In order to verify whether a RGB-IR re-ID method will show stronger performance on a stronger baseline, in this subsection, we evaluate the performance of the RGB-IR re-ID method proposed in this paper configured with a stronger AGW baseline [53]. The evaluation results are shown in Table 6.

The results in Table 6 show that the method proposed in this paper configured with a stronger baseline can achieve more significant improvement. On the large-scale SYSU-MM01 dataset, compared with the method proposed in this paper (baseline is JSIA-ReID), the method proposed in this paper (baseline is AGW) achieves 8.9% mAP improvement under al-search query setting and 8.3% mAP improvement under indoor-search query setting. Our method (baseline is AGW) has even achieved comparable performance as MACE [33], and our method is relatively simple and easy, without the need to design complex networks and elaborate classifier ensemble learning. According to the results in Table 6, we can draw two important conclusions: (1) the RGB-IR re-ID method proposed in this paper is very effective. RGB images are converted into fake IR images by the AlignGAN method for pixel alignment, which can effectively make up the crossmodality gap between RGB and IR images. Adding SE module to the network can not only obtain the global features of the images but also weight the channels, which is helpful to extract the structural features of pedestrians and plays an important role in RGB-IR re-ID. (2) RGB-IR re-ID method will show stronger performance on a stronger baseline. AGW baseline is very powerful, which provides important insights for the further study of RGB-IR re-ID.

4.6. Visualization of Images

In order to better show the effects of pixel alignment module () and crossmodality paired-image generation module (), in this part, we visualize the fake IR images generated by pixel alignment module and crossmodality paired images generated by crossmodality paired-image generation module. From Figure 4, firstly, we can see that the fake infrared images generated by the pixel alignment module have infrared style and keep the content information (view, posture, etc.) of the corresponding real RGB image. Therefore, the generated fake infrared images can reduce the huge crossmodality changes between RGB and infrared images. Secondly, we can see that when a person’s crossmodality unpaired image is given, whether it is a real RGB image and IR image, or a fake IR image generated by the AlignGAN module, our method can stably generate crossmodality paired images.

But objectively speaking, the generated images are not clear enough. As shown in Figure 4, for example, the legs of the image generated by person A under the umbrella are very blurry, and the contour of the whole person is very blurry in the image generated by person B under the background of steps, which intuitively shows the influence of occlusion and complex background on RGB-IR re-ID. It also shows that in the research of RGB-IR re-ID, besides the huge crossmodality differences among different modalities, the problems of occlusion, viewpoint change, and complex background faced by traditional re-ID are still huge problems to be solved urgently. We can learn from the mature methods in the field of re-ID, such as attribute-person recognition (APR) proposed by Lin et al. [21] and re-ID based on pose estimation proposed by Miao et al. [20], and apply these methods to RGB-IR re-ID, so as to solve the problems of occlusion and complex background in RGB-IR re-ID. In a word, we still have a long way to go in the research of RGB-IR re-ID.

5. Conclusions

In this paper, we propose a new method based on global and local alignment: AGF. Firstly, the RGB images are converted into IR images by using the AlignGAN model, which reduces the crossmodality difference. Then, the newly generated IR images and all the images in SYSU-MM01 are introduced into the framework, and the paired images are generated by separating and exchanging the modality-specific features between the unpaired images, and the set level and instance level are aligned at the same time. Finally, we first tried to apply SE-ResNet-50 network to the research of RGB-IR crossmodality re-ID and made a major breakthrough. The experimental results on SYSU-MM01 dataset show the effectiveness of our proposed method. In addition, we also verify that a RGB-IR re-ID method will show better performance on a stronger baseline, which shows the importance of designing a strong baseline in the region of RGB-IR re-ID.

Data Availability

The data related in our work is publicly available which is from reference [29] as we cited. Thus we think our description is suitable.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (No. 11801511).