Abstract

RGB-IR cross-modality person re-identification (ReID) can be seen as a multicamera retrieval problem that aims to match pedestrian images captured by visible and infrared cameras. Most of the existing methods focus on reducing modality differences through feature representation learning. However, they ignore the huge difference in pixel space between the two modalities. Unlike these methods, we utilize the pixel and feature alignment network (PFANet) to reduce modal differences in pixel space while aligning features in feature space in this paper. Our model contains three components, including a feature extractor, a generator, and a joint discriminator. Like previous methods, the generator and the joint discriminator are used to generate high-quality cross-modality images; however, we make substantial improvements to the feature extraction module. Firstly, we fuse batch normalization and global attention (BNG) which can pay attention to channel information while conducting information interaction between channels and spaces. Secondly, to alleviate the modal difference in feature space, we propose the modal mitigation module (MMM). Then, by jointly training the entire model, our model is able to not only mitigate the cross-modality and intramodality variations but also learn identity-consistent features. Finally, extensive experimental results show that our model outperforms other methods. On the SYSU-MM01 dataset, our model achieves a rank-1 accuracy of and an mAP of .

1. Introduction

Person ReID can be viewed as a cross-camera image retrieval problem, which aims at matching individual pedestrian images in a query set to ones in a gallery set captured by different cameras. Its main challenge lies in the interclass and intraclass variations caused by different lighting, poses, occlusions, and views. Most existing methods [15] mainly focus on matching RGB images captured by visible cameras, which can be formulated as an image matching problem under a single modality. However, these methods cannot be applied to images taken in poor lighting conditions, because the visible camera cannot capture pictures with discriminative features. However, in practical application scenarios, the camera should ensure all-weather operation.

Since the visible camera has limited effect on the security work at night, the camera that can switch the infrared mode is being widely used in the intelligent monitoring system. In visible mode and infrared mode, RGB images and infrared images are collected, respectively, which belong to two different modalities. RGB images have three channels but IR images have only one channel, so the ReID problem in a cross-modality setting becomes extremely challenging, which is essentially a cross-channel retrieval problem. First, infrared images of different identities are difficult to distinguish but are easy to distinguish in visible images. In addition, the same person varies greatly in different modalities. It is known as modality discrepancy.

To address visible-infrared person ReID, several approaches [610] have been proposed, aiming to mitigate modal differences by aligning features or pixel distributions. Feature alignment methods [6, 8, 10] mainly focus on bridging the gap between RGB and IR images through features. It is difficult to match RGB and IR images in a shared space due to large cross-modality differences between the two modalities. Different from existing methods that directly match RGB and IR images, we use generative adversarial networks to generate fake IR images based on real RGB images and then match the generated images through a feature alignment network. The generated fake IR images are used to reduce the modality difference between the RGB and IR images. Although the generated fake IR images are very similar to real images, there are still intraclass differences due to pose variations, viewpoint changes, and occlusions.

Inspired by the above discussion, in this paper, we propose a pixel and feature alignment network (PFANet) that simultaneously mitigates cross-modality differences in pixel space and intramodality variation in feature space. As shown in Figure 1, to reduce the modal difference, we apply a generator to generate fake IR images. Then, to alleviate the intramodality variation, a feature extraction module (F) is designed to encode fake and real IR images into a shared feature space by exploiting identity-based classification and triplet loss. The batch normalization and global (BNG) attention is added to the feature extraction network (F), which can make the network learn which channel is more important as well as can interact between channels and spaces. Furthermore, to mitigate the modal difference in the feature space, a modal mitigation module (MMM) is proposed, which can significantly mitigate the difference between the two modalities. Finally, to learn identity-consistent recognition, a joint discriminator is utilized. Its input is an image-feature pair.

The major contributions of this work can be summarized as follows:(i)We propose a generative adversarial network to generate cross-modality images that alleviated modal differences in pixel space. This model consists of a generator and a joint discriminator, by playing a max-min game, our model is able to not only reduce the cross-modality and intramodality variations but also learn identity-consistent features.(ii)We design a batch normalization and global (BNG) attention, which consists of channel attention and global attention. In the channel attention, we measure the importance of each channel by applying the scale factor of BN to the channel dimension and suppressing insignificant features. As for the global attention module, it can reduce information attenuation and amplify the features of global dimension interaction.(iii)We apply a modal mitigation module (MMM) to mitigate the modal distribution. The instance normalization (IN) is utilized to mitigate modal differences on a single instance. Moreover, the channel attention is used to guide the learning of IN, which can mitigate modal differences while preserving identity information.

2.1. RGB-IR Person ReID

RGB-IR cross-modality person ReID can be seen as a multicamera retrieval problem that aims to match pedestrian images captured by visible and infrared cameras, which are widely used in video surveillance, public security, and smart cities. Compared with RGB-RGB single-modality person ReID which only deals with RGB images, the key challenge in this work is to mitigate the large differences between the two modalities. To address the challenge caused by differences in modality distributions, a variety of approaches to cross-modality person re-identification have been proposed. Some early work focused on solving the channel mismatch between RGB images and IR images, due to RGB images having three channels. In contrast, IR images have only one channel. Wu et al. [10] proposed a deep zero-padding network and contributed a new ReID dataset SYSU-MM01. In [11], a dual-path network with a bi-directional dual-constrained top-ranking loss was introduced to learn modality alignment feature representations for RGB-IR ReID. Feng et al. [12] proposed a framework for solving heterogeneous matching problems using modality-specific networks. Ye et al. [13] proposed a dual-stream network with feature learning and metric learning to convert two heterogeneous modalities into a consistent space where the modalities share a metric. Dai et al. [6] introduced a cross-modality generative adversarial network (cmGAN) to reduce the distribution differences between RGB and IR features. Most of the above approaches mostly focus on reducing intermodality differences by feature alignment, while ignoring the large cross-modality differences in pixel space.

Unlike these approaches, the proposed model in this paper is able to combine feature alignment and pixel alignment, effectively reducing intramodality and cross-modality variations. By training the model, the model is able to learn identity consistency features.

2.2. GAN in Person ReID

A generative adversarial network (GAN) consists of a generator and a discriminator, using the idea of game theory, where the generator tries to generate an image to deceive the discriminator, and the discriminator tries to discriminate whether the image is real or generated. Through multiple adversarial training, generative adversarial networks are able to learn deep representations of data in a self-supervised manner. GAN can generate high-quality images, perform image enhancement, generate images from text, and convert images from one domain to another [14, 15]. GAN was first proposed in 2014’s [16]. After that, researchers have proposed a variety of task-specific GAN structures, such as CycleGAN [14], Pix2Pix [17], and StarGAN [15]. There are many works in the field of pedestrian re-identification that also apply GAN to improve accuracy. Li et al. [18] proposed a network that allows querying images of different resolutions to process cross-resolution person ReID. Wang et al. [19] designed an end-to-end alignment generative adversarial network (AlignGAN) for the RGB-IR ReID task. JSIA-ReID [20] implemented a two-layer alignment of pixels and features in a unified GAN framework.

In our work, we apply GAN to generate cross-modality images that mitigate modal differences between RGB-IR image data in pixel space.

2.3. Attention Mechanisms

There is an important feature in the human visual system that allows people to selectively focus on things of interest in order to capture valuable information. Inspired by the human visual system, many works have attempted to employ attention mechanisms to improve the performance of CNNs.

Attention mechanisms enable the network to focus on areas of interest to the human body and better extract useful information. SENet [21]integrated spatial information into the channel-level feature responses and computed the corresponding attention with two MLP layers. Later, bottleneck attention module (BAM) [22] built independent space and channel submodules in parallel and embedded them into each bottleneck block. Considering the relationship between any two positions of the feature map, nonlocal feature attention [23] was proposed to capture the relationship between them. The convolution block attention module (CBAM) [24] sequentially cascaded channel attention and spatial attention. However, these works ignored the information about the weights adjusted from the training; therefore, we wanted to highlight the significant features by using the variance of the trained model weights, which also was able to amplify cross-dimensional interactions and captured important features of all three dimensions. We propose new attention (BNG) to solve the above problem. A modal mitigation module (MMM) is designed to mitigate the modal distribution, using channel attention to guide the learning of instance normalization (IN) for mitigating modal differences while preserving identity information.

3. The Proposed Method

In this part, we introduce the proposed PFANet in detail. Our network will be presented in the following three parts, including (1) RGB-IR images generation module, (2) BNG attention module, and (3) modal mitigation module. To reduce cross-modality variation, we apply generative adversarial networks to convert RGB images to fake IR images, which have IR style while maintaining their original identity.

Then, the features of the two modalities are extracted for feature alignment. The BNG attention is designed to make the network focus on channel and spatial information. In addition, the modal mitigation module (MMM) is proposed to mitigate the differences between the two modalities. The main output of the PFAnet during testing is the feature for person ReID.

3.1. RGB-IR Images Generation Module

There is a large cross-modality difference between RGB and IR images, which significantly increases the difficulty of the task of cross-modality pedestrian re-identification. To reduce cross-modality variation, we apply generative adversarial networks to convert RGB images to fake IR images , which has IR style while maintaining their original identities. The generated fake IR image can mitigate the modality differences between RGB and IR images. The module consists of a generator that generates a fake IR image from an RGB image and a joint discriminator that discriminates whether the image is a real image or a generated image. The input of the generator is the real images , and its output is the fake IR images . The input of the discriminator is the generated fake IR image ; if the image is real, its output is one, and if the image is the generated image, the output is zero. The goal of the generator is to make the generated image as similar as possible to the real image, and the goal of the discriminator is to discriminate as much as possible whether the input image is real or generated. Unlike ordinary discriminators, the input to our discriminator is a pair of IR images and ReID feature maps. The generator and discriminator play the min-max game as [16], and the modal can make the fake IR image as realistic as possible.

The adversarial loss for generating IR images is defined as follows:where

Among them, is the extracted image feature of and is the extracted image feature of generated image . Equation (1) is used to train the generator model; after the constraint of the loss function, the generator will generate a more realistic IR image. Equations (3) and (4) are used to train the discriminator, which differs from traditional discriminators in that the input is a pair of image features. It has two advantages, firstly, the fake IR image will be closer to the real IR image through the max-min game [16], and the distribution of the features of the fake IR image will be more similar to the real image features . Secondly, is able to maintain the identity-consistency by the corresponding image constraint. Although loss can ensure that the fake IR image resembles the real IR image , there is no guarantee that the generated fake IR images retain the structure and content of the original RGB images . To deal with this problem, we introduce a generator for generating IR images into RGB images and the corresponding discriminator . Also we introduce cycle-consistency loss which is defined as follows:

loss enables the generated IR image to be consistent with the input real RGB image. We use the L1 norm instead of the L2 norm because the L1 norm allows the generator to generate better image edges. Specifically, we input the real RGB image into the generator to generate the fake IR image and then use the generator to generate the reconstructed RGB image from the fake IR image. We do something similar with IR images.

Now, the loss of the generator can be defined as follows:where is the weight of cycle loss and is set to 10 as in [14]. By using this loss during adversarial training, we can generate high-quality IR images.

3.2. The BNG Attention Module

Our proposed BNG attention is an efficient and lightweight attention mechanism. The BNG attention can be embedded at the end of any convolutional neural network, for the residual network ResNet-50; the end of the residual structure can be embedded. The structure of BNG is shown in Figure 2.

BNG attention consists of two submodules, as shown in Figure 2(a); the channel attention submodule can use the weight information of the trained model to highlight salient features. We obtain its scale factor from batch normalized (BN [25]) as shown inwhere and are the mean and standard deviation of mini batch and and are the trainable parameters used to fit the data distribution.

The formula for channel attention can be expressed as follows:where is the scale factor for each channel, and the weights are obtained as . We measure the importance of each channel by applying the scale factor of BN to the channel dimension and suppressing insignificant features. Since channel attention only focuses on channel information, there is no global space-channel information interaction; to solve this problem, we design a global attention module. It can reduce information attenuation and amplify the features of global dimension interaction. Inspired by CBAM [24], the channel attention and spatial attention are connected in turn. The main structure is shown in Figure 2(b). Given the input feature map , the intermediate state and output are defined as follows:where and are the channel and spatial attention maps, respectively. denotes element-wise multiplication.

The channel attention submodule uses a 3D arrangement to preserve information across three dimensions and then uses a two-layer MLP layer that amplifies the channel spatial dependencies across dimensions. The channel attention submodule is illustrated in Figure 3.

In the spatial attention submodule, to focus on the spatial information, two convolutional layers are used to fuse the spatial information. The size of the convolution kernel is set to . Since max-pooling reduces information and has a negative influence, we remove the max-pooling operation to retain more features. The same reduction ratio is adopted from the channel attention submodule, same as BAM. The spatial attention submodule without group convolution is shown in Figure 4.

3.3. Modal Mitigation Module (MMM)

To mitigate the modal distribution, a modal mitigation module (MMM) is designed. For the input image X, we denote the features extracted in the convolution block as and input it into the MMM, where represent the height, width, and a number of channels of the feature map , respectively. The instance normalization (IN) is used to mitigate modal differences on a single instance [27]. Instance normalization (IN) computes the mean and variance in a single instance and reduces the difference between the two data distributions. However, using IN directly may has a negative impact on the ReID task. Because the distribution of image data has changed significantly, some identifying information may be lost.

To overcome these shortcomings, we use channel attention to guide the learning of IN, which mitigates modal differences while preserving identity information. Specifically, we input the feature into a two-layer MLP to downsample the channels and then upsample to the original number of channels and use the activate function to activate the feature as a mask to supervise the IN operation:where is the channel mask, representing the identity-related channels, and is the instance-normalized result of the input .

Similar to SENet [21], the method of generating a mask with channel dimension can be expressed as follows:where and are learnable parameters in the two bias-free fully connected (FC) layers, which are followed by ReLU activation function and sigmoid activation function . denotes global average pooling of features. In order to balance performance and reduce the number of parameters, the downsampling ratio is set to .

The formula for instance normalization is defined as follows:where is to calculate the mean of each dimension and is to calculate the standard deviation of each dimension. To avoid dividing by zero, we add to the denominator, and is the j-th dimension of the feature map .

3.4. Loss Function

In this section, we will introduce the loss we used when training the generator to generate a fake IR image . On the one hand, should be classified to the same identity class as the corresponding ; on the other hand, should satisfy the triplet loss [28] of the corresponding identity constraint. We define these two losses as and and denote them inwhere is the predicted probability of belonging to the ground-truth identity; the ground-truth identity of the fake IR image should be the same as that of the original RGB image .

Although the generated image can reduce cross-modality differences, there are still large intramodality differences caused by lighting, human pose, and view. We minimize the fake IR image and the real IR image in a shared space via identity-based classification and triplet loss. We define these two losses as and and denote them inwhere represents the predicted probability that the input belongs to the ground-truth identity, and means the union sets. In summary, the overall loss of our module is shown in and are calculated by equations (1) and (2). , , , and are calculated by equations (14) and (15), respectively. Among them, , , , , and .

4. Experiments

4.1. Datasets and Settings

We evaluate our model on SYSU-MM01 [10]. SYSU-MM01 is a very popular RGB-IR ReID dataset; it contains pedestrian images captured by six cameras, including two infrared cameras (camera3 and camera6), and four natural light cameras (camera1, camera2, camera4, and camera5). For each pedestrian, there are at least 400 RGB images and IR images with different poses and viewpoints. Among them, 296 IDs are used for training, 99 IDs are used for verification, and 96 IDs are used for testing. Following [29], there are two test modes, i.e., all-search mode and indoor-search mode. For the all-search mode, all images are used. For the indoor-search mode, only use indoor images from 1st, 2nd, 3rd, and 6th cameras. Both modes employ single-shot and multishot settings, in which 1 or 10 images of a person are randomly selected to form a gallery setting. Both modes use IR images as probe sets and RGB images as gallery sets.

Evaluation protocols: we use cumulative matching features (CMC) and mean average precision (mAP) as evaluation metrics. Following [29], the results of SYSU-MM01 are evaluated using the official code based on the mean of 10 repeated random splits of the gallery and probe set.

Implementation details: we use the ResNet-50 [30] pretrained on ImageNet as the CNN backbone, use the output of its pool5 layer as the feature map , and use the average pooling to obtain the feature vector . We add BNG-attention to each layer of residual blocks in ResNet-50 and MMM module after the third and fourth layers. For triplet loss, we use the FC layer to map the feature vector V into a 256-dimensional embedding vector. For classification loss, the classifier takes the feature vector V as input and includes a 256-dim fully connected (FC) layer, followed by batch normalization [25], dropout, and RELU as the middle layer, and an FC layer with the identity number logit as the output layer. The dropout rate is set at 0.5. We use PyTorch to implement the model, the images are data augmented by horizontal flipping, and the batch size is set to 72 (9 people, each of which has 4 RGB images and 4 IR images). For the learning rate, the learning rate of the generation module and discriminator module is set to 0.0002 and optimized using the Adam optimizer. We set the classifier and the embedder to 0.2 and the CNN backbone to 0.02 and optimize them by SGD.

4.2. Comparison with the Other Methods

In this section, we compare our method with several different cross-modality person ReID methods including the following methods: (1) with different structures and loss functions, two-stream [10], one-stream [10], zero-padding [10], BCTR [13], BDTR [13], D-HSME [26], and DGD + MSR [12] learned modality-invariant features and align them in feature space and (2) cmGAN [6] and JSIA [20] use the generative adversarial networks (GANs) to generate cross-modality IR images; they mitigate modal differences in pixel space. The experimental results are shown in Table 1.

In Table 1, we can find that there are various evaluation protocols, i.e., all-search/indoor-search and single-shot/multishot; firstly, for the same method, indoor-search performs better than all-search, because the images have less background variation in indoor mode, and matching is easier. Secondly, the rank scores of single-shot are lower than ones of multi-shot, but mAP scores of single-shot are higher than ones of multishot. This is because, in multishot mode, there are ten images in the gallery setting, while in single-shot, there is only one image. As a consequence, under the multishot mode, it is much easier to hit an image but difficult to hit all images. This situation is inverse under the single-shot mode.

The R1, R10, and R20 denote Rank-1, Rank-10, and Rank-20 accuracy . The mAP denotes the mean average precision score , and our model shows good performance. Compared with JSIA, our model achieves over on Rank-1 and on mAP in the single-shot setting of all-search mode. In the single-shot setting of indoor-search mode, our model achieves a rank-1 accuracy of and an mAP of . In the multishot setting of indoor search, our model achieves a rank-1 accuracy of , and an mAP of , which is higher than JSIA by and , respectively.

4.3. Ablation Study

In this section, we design ablation experiments to test the effectiveness of the BNG module and MMM module. Our ablation experiments are performed on the dataset SYSU-MM01 and use the single-shot setting of all-search mode.

Influence of BNG module: the results of ablation experiments for BNG attention are shown in Table 2. Compared with the baseline model (B), by adding BNG attention, the rank-1 accuracy and mAP are improved by and , proving the effectiveness of BNG attention.

Influence of MMM module: as shown in Table 2, the model with MMM (B + MMM) achieves a rank-1 accuracy of and an mAP of , which are higher than those of the baseline (B) by and , respectively. It is proved that our proposed MMM module has good performance.

4.4. Visualization of Generated Images

For a more intuitive understanding of the generator model, we show the learned fake IR images in Figure 5. As shown in Figure 5, the first row is the real RGB image, the middle is the fake IR image generated by the generator, and the last row is the real IR image. We can observe that fake IR images have similar content (e.g., pose and view) and maintain the identity of the corresponding real RGB images while having an IR style. Therefore, the generated fake IR images can bridge the gap between RGB and IR images and can reduce cross-modality variation in pixel space.

5. Conclusion

In this paper, we proposed a new pixel and feature alignment network (PFANet) for the RGB-IR ReID task. The model consisted of a feature extractor, a generator, and a joint discriminator. The BNG attention and the MMM module were designed in the feature extraction module. Through these two modules, the model not only mitigated modality differences but also paid attention to channel and global information. The cross-modality IR images were generated by the generator, which could bridge the gap between RGB and IR images and reduce cross-modality variation. Ablation experiments verified the effectiveness of each module. Extensive experiments on the SYSU-MM01 dataset illustrated that our model achieved state-of-the-art performance.

Data Availability

The SYSU-MM01 data used to support the findings of this study have been deposited in the “Rgb-infrared cross-modality person re-identification” repository (http://isee.sysu.edu.cn/project/RGBIRReID.html).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (grant nos. 51906217, 61906168, and 62176237), Joint Funds of the Zhejiang Provincial Natural Science Foundation of China (grant no. LZJWZ22E090001), Zhejiang Provincial Natural Science Foundation of China under (grant no. LQ20F020024), and the Hangzhou AI Major Scientific and Technological Innovation Project (2022AIZD0061).