Advances in Processing, Mining, and Learning Complex Data: From Foundations to Real-World ApplicationsView this Special Issue
Research Article | Open Access
Transferable Feature Representation for Visible-to-Infrared Cross-Dataset Human Action Recognition
Recently, infrared human action recognition has attracted increasing attention for it has many advantages over visible light, that is, being robust to illumination change and shadows. However, the infrared action data is limited until now, which degrades the performance of infrared action recognition. Motivated by the idea of transfer learning, an infrared human action recognition framework using auxiliary data from visible light is proposed to solve the problem of limited infrared action data. In the proposed framework, we first construct a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework to map the infrared data and visible light data into a common feature space, where Kernel Manifold Alignment (KEMA) and a dual aligned-to-generalized encoders (AGE) model are employed to represent the feature. Then, a support vector machine (SVM) is trained, using both the infrared data and visible light data, and can classify the features derived from infrared data. The proposed method is evaluated on InfAR, which is a publicly available infrared human action dataset. To build up auxiliary data, we set up a novel visible light action dataset XD145. Experimental results show that the proposed method can achieve state-of-the-art performance compared with several transfer learning and domain adaptation methods.
Human action recognition aims to recognize an ongoing action from a video clip, which has received great attention in recent years due to its wide applications, including video surveillance , video labeling , video content retrieval , human-computer interaction , and sports video analysis . Over the past decades, significant progress has been made in action recognition  and most of the state-of-the-art approaches for action recognition have been put into visible imaging videos [7–9]. In addition, many visible light action datasets have been constructed for action recognition, such as KTH , HMDB51 , and UCF101 . Generally speaking, human action recognition in visible light has been well addressed and successfully applied to some applications. However, illumination change, shadow, background clutter, and occlusion remain to be great challenges for visible light action recognition .
With the development of sensor technology, human actions can be captured by the thermal infrared cameras instead of the visible light ones. Compared with visible light action recognition, infrared action recognition can solve the aforementioned challenges . For example, the infrared thermal imaging is robust to illumination change because it can capture humans well under poor light condition when the person can almost not be seen in the visible light videos, which is very useful for night surveillance or human-computer interaction (HCI) under dim night. In addition, as the temperatures of the shadow, background clutter, and occlusion are relatively low compared with that of the humans or moving objects in infrared videos, these challenges can be well suppressed in infrared videos. With these properties, infrared action recognition can be adopted in more applications and outperform that in visible light. Therefore, infrared action recognition may become a next hot topic in computer vision in the future.
Actually, infrared and visible action data lie in different feature space and the traditional approaches for visible light action recognition cannot be directly applied to infrared action recognition due to the modality gap between them. However, the methods for infrared action recognition are limited. Furthermore, there is only one publicly available infrared dataset InfAR  for action recognition until now. As a result, the performance of infrared action recognition in previous works is preliminary and leaves a reasonable space to further promote its performance. To these issues, if a large amount of previously annotated videos from various visible light videos datasets can be transferred to infrared domains for recognition, considerable amount of time-consuming human efforts, such as collecting and hand labeling a large amount of infrared action videos, can be saved. In addition, as infrared and visible light videos may contain complementary information, infrared action recognition performance can be improved if the knowledge from visible light and infrared data can be properly integrated. Nevertheless, it would have at least two obstacles to integrate these two datasets. Infrared and visible light videos are captured by different sensors; the strong modality gap between them will degrade the recognition performance without effective transferable feature representation. In real-world scenario, infrared videos are usually limited while the visible ones are abundant. The imbalanced data distribution will also degrade the classification performance.
Tackling these problems, a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework is proposed for infrared action recognition task in this paper. To be more specific, we focus on adapting, aligning, and generalizing representations from different domains to a single common feature space in order to bring the original target domain (infrared action data) and the auxiliary source domain (visible light action data) into the same feature space. And then we learn a unique classifier in that semantically meaningful aligned and generalized feature space across datasets. In this way, the modality gap between these two datasets is reduced. To better use the data in the generalized feature space, we adopt semisupervised technique so that both the labeled and unlabeled data are considered in our method. In more detail, Kernel Manifold Alignment (KEMA)  is adapted to cross-dataset action recognition to generate aligned representations and then cross-domain generalized features are learnt by training two novel aligned-to-generalized encoders (AGE) on the source and target datasets in parallel. To build up source domain data, we set up a new visible light action dataset called XD145. Putting all the things together, we can summarize the main contributions of this paper as follows:(i)We have proposed a novel Cross-Dataset Feature Alignment and Generalization (CDFAG) framework to address the visible-to-infrared action recognition problem. It can efficiently reduce the modality gap across datasets and generate aligned and generalized feature representations in a common space with low intraclass diversity and high interclass variance.(ii)It is the first time Kernel Manifold Alignment (KEMA) is applied in infrared action recognition field to generate aligned representations in a common latent space.(iii)We have designed a novel aligned-to-generalized encoder (AGE) model to learn generalized feature representations after feature alignment by KEMA.(iv)We achieved state-of-the-art results in visible-to-infrared action recognition compared with several transfer learning, domain adaptation, and deep learning based methods.(v)Since there are a limited number of action videos from existing benchmark visible light datasets which share the same class label with that of the InfAR dataset, we have constructed a new visible light action dataset called XD145 to build up auxiliary source domain data. And this dataset could be further utilized as the benchmark visible light action dataset.
The rest of the paper is organized as follows. In Section 2, we review some background and related works. In Section 3, we explain details of our proposed method. Section 4 presents the experimental results of our proposed method on visible-to-infrared action recognition and cross-dataset action recognition, and finally Section 5 draws the conclusion and future research lines.
2. Related Work
In this section, we present the background and related works. We briefly review the concepts, methods in transfer learning and domain adaptation, their benefits in cross-domain action recognition, and the development status of infrared action recognition.
2.1. Transfer Learning and Domain Adaptation
The classical pattern recognition and machine learning tasks [21–23] mainly adopt a robust classifier learnt by annotated training data and assume the testing data and the training data belong to the same feature space or distribution. However, it is unrealistic in real-world applications because of the high price of manual labeling training samples and environmental restrictions. Therefore, sufficient training data that share the same feature space or distribution with the testing data cannot always be guaranteed even using some feature selection methods [24, 25] without considering the distribution gap. In this case, the potential discriminability of the trained model can be limited by the insufficient training data. Of the several schools of thought addressing this problem, two prominent ones are transfer learning  and domain adaptation . In fact, transfer learning methods are closely related but not equivalent to domain adaptation. Transfer learning aims to transfer the knowledge from a source domain to the target domain while domain adaptation methods are essentially solving transfer learning problems. Surveys like  show that the type of knowledge being transferred can be roughly classified into four categories: instance transfer, feature-representation transfer, parameter transfer, and relational-knowledge transfer. Our proposed method falls into feature-representation transfer by adapting the representations from different domains to a single common latent space. The literature of feature-representation transfer can be roughly divided into three kinds of adaptation problems: supervised, unsupervised, and semisupervised adaptation problems, depending on the availability of labels in different domains.
Semisupervised domain adaptation has attracted much attention in recent years. For example, a reconstruction-based domain adaptation method called latent sparse domain transfer (LSDT) was proposed in  for visual categorization of heterogeneous data via subspace learning and sparse representation. A -norm based discriminative robust kernel transfer learning (DKTL) method was proposed in  to address distribution mismatch problem of image classification across domains. Although the methods in [28, 29] achieve good performance in cross-domain image classification, to the best of our knowledge, a more challenging problem of visible-to-infrared cross-dataset action recognition was not studied. Extreme learning machine was used in  to address the visual knowledge adaptation problem for video event recognition and object recognition. It should be noted that the proposed method is based on feature-representation domain adaption, which is essentially different from the extreme learning machine methods, which are classifier-based domain adaptation approaches. Apart from the methods mentioned above, manifold alignment is an important kind of semisupervised domain adaptation methods, which concurrently matches the corresponding samples and preserves the geometry of each domain by graph Laplacian . Actually, data manifolds alignment boils down to finding projections to a common latent space. Semisupervised method Kernel Manifold Alignment (KEMA) was proposed in  and has been successfully applied to multimodal visual object recognition , multisubject facial expression recognition , and multitemporal remote sensing image classification . Nevertheless, Kernel Manifold Alignment (KEMA) method has not been applied in cross-domain human action recognition. Therefore, we have studied the effectiveness of the Kernel Manifold Alignment (KEMA) and adapted it to visible-to-infrared action recognition to obtain aligned feature representations across datasets. The main focus of our work is on the use of KEMA for feature alignment in visible-to-infrared action recognition which was not addressed in .
2.2. Cross-Domain Action Recognition via Transfer Learning
With the development of action recognition, applying transfer learning to action recognition datasets generated by different sensors such as visible light cameras, infrared cameras, RGB-D cameras, wearable sensors, or other sensor modalities has received great interests in recent years . As video sequences are the most common type of action datasets, the difference in manipulating and deploying the camera to capture actions leads to different issues, for example, various camera viewing angles, cluttered background, illumination changes, and different light spectrums such as visible light spectrum and infrared spectrum, and all contribute to significant variance in the captured videos. Therefore, action recognition especially cross-domain action recognition is a challenging problem.
Transferring knowledge for cross-view action recognition is prevailing . For example, Zheng et al.  proposed learning a pair of dictionaries simultaneously from videos pairs taken at different views to encourage each video pair to have the same sparse representation. Zhang et al.  proposed a linear transformation to transform source view to target view via virtual path. Wu et al.  proposed a method to discover a discriminative common representation space where source and target views are linked and knowledge is transferred between them. Sui et al.  introduced two different projection matrices to map the action data from two different views into the common space with low intraclass diversity and high interclass variance and reducing the mismatch between them. Zu and Zhang  introduced a method called Canonical Sparse Cross-view Correlation Analysis to address multiview feature extraction problem. Different from the above-mentioned cross-view action recognition problems, we tackle the visible-to-infrared action recognition problem, which is essentially a cross-dataset action recognition problem.
In recent works about cross-dataset action recognition, Bian et al.  proposed a transfer topic model (TTM) which utilized information from the auxiliary domain to assist recognition tasks in the target domain. Zhu and Shao  introduced a weakly supervised cross-domain dictionary learning (WSCDDL) approach which learns a reconstructive, discriminative, and domain-adaptive dictionary pair and the corresponding classifier parameters to address cross-domain image classification, action recognition, and event recognition problems. Tang et al.  improved the accuracy of action recognition in RGB videos by activating the borrowing of visual knowledge across different video modalities such as RGB videos, the depth maps, and the skeleton data of actions. Liu et al.  proposed a simple to complex action transfer learning model (SCA-TLM) to leverage the abundant labeled simple actions to improve the performance of complex action recognition. Although these works can achieve promising results in their related fields, visible-to-infrared action recognition has not attracted much attention until now.
With the revival of neural networks in recent years, many neural networks based transfer learning methods have been proposed as well. Kan et al.  proposed bishifting autoencoder network (BAE) to alleviate the discrepancy between source and target domains and evaluate its effectiveness in face recognition. Xu et al.  tackled the cross-dataset action recognition problem by training a pair of many-to-one encoders in parallel to map raw features from the source and target datasets to the same space. Although dual many-to-one encoder in  can generalize features well across datasets, it requires a large number of labeled training samples from both source and target datasets to learn domain-invariant features without utilizing auxiliary domain data as an aide. In addition, the inputs of the encoders are raw action features named action bank features without considering feature alignment in advance. Different from the above-mentioned neural networks based transfer learning methods, we take both feature alignment and auxiliary domain data into consideration and propose a novel aligned-to-generalized encoder (AGE) model to map the aligned feature representations to the same generalized feature space with low intraclass diversity and high interclass variance.
2.3. Infrared Action Recognition
Basically, most of the current research efforts for action recognition have been put in visible light videos while infrared action recognition has not attracted much attention. Recently, increasing efforts have been devoted to infrared action recognition. For example, Han and Bhanu  proposed an efficient spatiotemporal representation for human repetitive action recognition under thermal infrared scenarios. Han and Bhanu  introduced a hierarchical scheme to combine color and thermal images to improve human silhouette detection. Eum et al.  used hog and support vector machine to realize infrared action recognition at night. However, these works focus on simple actions under relatively simple environment with limited infrared data. In addition, there is no publicly available infrared action recognition dataset until Gao et al.  built the first public infrared action recognition dataset called InfAR. In , state-of-the-art action recognition pipelines including widely used low-level local descriptors were evaluated in InfAR dataset. Then Gao et al.  extended their previous work  and utilized several state-of-the-art pipelines based on low-level features and deep convolutional neural network to evaluate their new infrared action recognition dataset (InfAR). However, the best recognition accuracy in  is relatively low and leaves a reasonable space to further promote the performance on InfAR dataset.
Actually, transfer learning has seldom been applied to infrared action recognition. For example, Zhu and Guo  proposed applying adaptive support vector machine (A-SVM)  to adapt the existing visible light action classifier to classify infrared actions and achieved preliminary results in their own dataset. Although the adaptive support vector machine (A-SVM) based method in  can perform better than direct matching, A-SVM is essentially a classifier-transfer based method without considering the max-margin property for the adapted classifier on target instances; therefore it faces accuracy degradation as a result of overfitting.
Our proposed method differs from the above-mentioned approaches in such aspects that it more comprehensively projects and aligns data from different domains in a nonlinear way through kernelization to generate aligned representations and learns a dual aligned-to-generalized encoders to obtain cross-domain generalized features while considering both the discriminability and domain adaptability at the same time. In our proposed CDFAG, the learned classifier across source domain and target domain becomes more discriminative against modality gap because of the integration of both source and target domain knowledge, while a majority of previous transfer learning methods focus on incomplete target domain without utilizing other domain data as an aide to improve the performance of original recognition systems. We will detail our proposed method in Section 3.
3. The Proposed Method
In this section, we detail our proposed CDFAG. An overview of the CDFAG is presented in Figure 1. Actually, the proposed CDFAG is semisupervised as both the labeled and the unlabeled data are used in source and target training sets. Our proposed CDFAG consists of three stages. In the first stage, feature extraction and encoding are accomplished on both the source and target datasets, where improved dense trajectories (iDTs) features are extracted, encoded, and reduced to a low-dimensional subspace. In the second stage, aligned features of source and target domains are generated by Kernel Manifold Alignment, then a pair of aligned-to-generalized encoders are trained on the source and target datasets in parallel guided by the centroids of training aligned instances from each class, and after that the output values of the encoders are extracted as the ultimate generalized representations. Finally, a support vector machine is built on the generalized features extracted from both the source and target datasets and then used to classify the new features extracted from unseen samples of the target dataset.
3.1.1. Feature Extraction and Encoding
In this paper, we choose improved dense trajectories (iDTs)  features with trajectory shape, HOF, MBHx, and MBHy as the low-level action video representation. The total length of the feature vector is 330. Specifically, we use the implementation released on the website of Wang (https://lear.inrialpes.fr/people/wang/improved_trajectories/) for iDTs and choose the default parameter setting. For iDTs, a large number of local trajectory descriptors may lead to high computational complexity and memory consumption. To cope with this issue, we adopt Locality-constrained Linear Coding (LLC)  scheme to represent the iDTs by multiple bases, which can bring less quantization error while preserving the local smooth sparsity. Taking both efficiency and the construction error into consideration, LLC encoding scheme is applied to the iDTs with 5 local bases, and the codebook size is set to be 4000 for all training-testing partitions. Thus, the dimension of the encoded iDTs features is 4000. To reduce the complexity when constructing the codebook, only 200 local iDTs are randomly selected from each video sequence.
3.1.2. Principal Component Analysis
After LLC encoding, the feature representations are still high dimensional and strongly correlated. To obtain more compacted feature representation, we utilize principal component analysis (PCA)  to preprocess these features. In our method, we retain top principal components such that the cumulative corresponding eigenvalues cover over 99% of the total eigenvalues. In our experiments, this reduces feature dimension down to the range of 500 to 600, varying between datasets.
3.2. Feature Alignment by Kernel Manifold Alignment
In this section, we detail the feature alignment method based on Kernel Manifold Alignment (KEMA). An illustration of how feature alignment functions is shown in Figure 2.
To fix notation, we consider K input domains. The data instances of each domain belong to c different classes. Let represent the th input domain, where is the number of samples in the th domain. The idea of kernelization is to map the input data instance into a high dimensional Hilbert space with the mapping function such that the mapped data is better suited for solving our problem linearly. Kernel trick is adopted in our method to avoid high computational load. Therefore, we define a kernel function computing the similarity between mapped instances without having to compute explicitly. Many common types of kernel functions can be adopted in KEMA, such as the RBF kernel, the linear kernel, and the polynomial kernel. In this paper, we use the RBF kernel. Considering multiple data modalities here, we would have to map the datasets to Hilbert spaces of dimension .
3.2.2. Kernel Manifold Alignment (KEMA)
The KEMA method aims to construct domain-specific projection functions, , to project the data in Hilbert space from all domains to a new common latent space, on which the instances’ topology of each domain is preserved, the instances from the same classes will locate nearly, and the ones from different classes will be far from each other. To do so, KEMA aims to find a data projection matrix that minimizes the following cost function:where TOP, SIM, and DIS denote the topology, class similarity, and class dissimilarity, respectively. is a parameter balancing the contribution of the similarity and the topology terms. As , we can see that when , more importance is given to topology and vice versa. The three terms are defined as follows:(1)Minimizing a topology-preservation term, TOP, which aims to preserve the local topology of each data domain: where in the similarity matrix representing the similarity of and , which can be computed as . is the graph Laplacian matrix issued from while is the diagonal row sum matrix defined as .(2)Minimizing a class similarity term, SIM, which encourages the locations of instances with the same class label to be close with each other in the new latent space: where in the similarity matrix is set to be 1 if two instances from domains and share the same class label and 0 otherwise (including the case when the label information is not available). The corresponding diagonal row sum matrix is defined as and the graph Laplacian matrix .(3)Maximizing a class dissimilarity term, DIS, which encourages instances with different class labels to be separated in the new latent space, where in the dissimilarity matrix is set to be 1 if two instances from domains and are from different classes and otherwise (including the case when the label information is not available). The corresponding diagonal row sum matrix is defined as and the graph Laplacian matrix .
It is straightforward that the solution of (5) boils down to finding the lowest eigenvalues of the following generalized eigenvalue decomposition : where is a matrix containing the matrices in a block diagonal form and contains the row eigenvectors for the particular domain defined in Hilbert space , where , , and is the eigenvalues of the generalized eigenvalue decomposition problem. Note that and are high dimensional and cannot be explicitly computed. Therefore, the eigenvectors are expressed as a linear combination of mapped instances using the Riesz representation theorems  and and in matrix notation . In (6), by multiplying both sides by and replacing the dot products with the corresponding kernel matrices, , the final solution is obtained: where is a matrix containing the kernel matrices in a block diagonal form. The block structure of projection matrix is as follows: where the eigenvectors for the first domain are highlighted in bold.
Once the projection matrix is obtained, the instance from th domain can be projected to the new latent space by first mapping to its corresponding kernel form and then applying the corresponding projection vector defined therein: where is a kernel evaluations vector between instance and all instances from domain used to define the projections . Similar to eigenvalue decomposition based methods, the data can be projected onto a lower-dimensional subspace by simply preserving the first columns of , where is the total number of samples involved in the kernel matrices and . In this sense, KEMA leaves some control on the dimensionality of the latent space for feature alignment.
In this paper, the number of input domains is set to because there is only one target domain (infrared dataset) and one source domain (visible light dataset) in our experiments.
3.3. Feature Generalization by Aligned-to-Generalized Encoders
Due to the huge modality gap between infrared and visible light data, a unified subspace may not exist when only using KEMA to align features from both domains. To be more specific, KEMA holds the assumption that a unified aligned space for both the source and the target domain exists. This assumption is too strict and may be invalid for some cases. Therefore, we relax this strict assumption and learn transferable feature representations across infrared and visible domains in a hierarchical way. With the obtained aligned representation, aligned-to-generalized encoders (AGE) model is adopted to force the outputs to be identical to the input aligned instances from the same action class. The AGE is trained by the guidance of the identical representation of the same action class, where intraclass diversities are minimized and generalized representations are generated across datasets. In this section, we present the architecture and details of the proposed AGE.
3.3.1. Target Output Generation
The centroid of each action class is used as the target output, which is computed by averaging over instances’ aligned feature representations in each class. Let and denote the aligned representations of th and th training instances from class c in the source and target dataset, respectively, and the target output for instances from class c is defined as follows: where and denote the total number of instances of class from the source and target datasets.
3.3.2. Encoders Training
At the feature generalization stage, a pair of aligned-to-generalized encoders are trained on the source and target datasets in parallel. For instances, with the same action class label, the target outputs of the two aligned-to-generalized encoders are identical, which forces the two aligned-to-generalized encoders to generalize to varying inputs and guide outputs of the same class instances to be similar. In this sense, aligned instances with the same class label from two datasets are mapped to the same feature space [36, 53] with low intraclass diversity  and then generalized and discriminative representations of the instances are generated across datasets.
In this section, we demonstrate the benefits of using a pair of aligned-to-generalized encoders for feature generalization in visible-to-infrared action recognition. The architecture of the aligned-to-generalized encoders (AGE) is illustrated in Figure 3. The AGE are essentially fully connected feedforward neural networks with an input layer, a hidden layer, and an output layer. Although the intuition that a deeper network architecture with more than one hidden layer can learn more robust and discriminative representations, it has been shown that carefully configured and trained single-hidden-layer neural networks can also achieve good performance in many tasks [19, 54], which is validated in our experiments as well. In addition, the architectures of the AGE trained on the source and target datasets are identical.
After feature extraction and encoding, the training videos from both datasets are represented by iDTs features encoded by LLC, which are then reduced to a low-dimensional subspace via PCA. Then, the KEMA method is applied to map the raw feature representations of instances from two datasets to the common latent space to obtain aligned features, shown in Figure 3. For simplicity, we denote one training aligned instance as with dimension . Therefore, both aligned-to-generalized encoders have an input layer of size , a hidden layer of size , and an output layer of size L, where L is the length of the ultimate aligned output feature vector and and are user defined parameters. In this work, we empirically restrict to be equal to and experimented with hidden layer sizes that range from 50 to 500 in Section 4.4.3.
Although both aligned-to-generalized encoders have the same architecture, they have different parameters. The goal of training the two encoders in parallel is to find a mapping between training aligned instances and target generalized outputs. Take the source encoder as an example; the mapping is accomplished via and . and are defined as follows: where is the activation function and , , , and are the parameters for and , respectively. The encoder parameters are indicated in Figure 3. Given a hidden layer size , the weights and biases in both encoders are initialized with random numbers drawn from a uniform distribution ranging between 0 and 1. The sigmoid function was used as the activation function:
The objective functions are defined as follows: where and are defined in (11), , , , and are the weights and biases of the encoder, and is the number of training instances.
Stochastic gradient descent was utilized to minimize the objective function by iteratively updating the weights and biases. For example, is updated as follows: where is the update to at the th iteration, is the objective function value at the th iteration, denotes the learning rate, and denotes the momentum. , , and are updated in a similar way.
In our method, the trained values of the output layer are extracted as the ultimate generalized features. When the objective function is minimized, the output values of the encoder are approximate solutions staying close to the target outputs instead of being identical to the predefined target outputs. Therefore, the final features extracted from aligned instances of the same action class, from both the source and the target datasets, would lie in the same cluster with small intraclass diversity and high interclass variance. This phenomenon will be illustrated in Section 4.4.2.
To make it clear, the proposed CDFAG is summarized in Algorithm 1.
Due to the lack of infrared data, directly using neural network for classifying may lead to overfitting. Therefore, we just use the AGE as a feature extractor rather than a classifier to avoid overfitting. In our experiments, we use multiclass support vector machine (SVM) as a classifier rather than softmax in AGE because SVM classifier could obtain better results, which has been validated in [55, 56]. To perform visible-to-infrared action classification, a SVM classifier with RBF kernel is trained on generalized features extracted from both visible light (source) and infrared (target) datasets and tested on generalized features extracted from unseen instances from infrared (target) dataset, as shown in the bottom of Figure 1. In Section 4, we will show that such classification scheme can effectively classify unseen action data in target dataset. This can be attributed to the successful knowledge transfer from the source domain to the target domain by our proposed CDFAG. To make it clear, the testing procedure of our proposed CDFAG is summarized in Algorithm 2.
4. Experimental Results
In this section, we present our experimental results on the benchmark dataset. We will start with describing the individual datasets, followed by details of our experimental settings.
The InfAR dataset (https://sites.google.com/site/gaochenqiang/publication/infrared-action-dataset/) and the XD145 dataset (the dataset will be available at: https://sites.google.com/site/yangliuxdu/) are used for the visible-to-infrared action recognition task, where the XD145 dataset is used as the source domain and the InfAR dataset is used as the target domain.(A)InfAR The InfAR dataset  consists of 600 video sequences captured by infrared thermal imaging cameras. As shown in Figure 4, fight, handclapping, handshake, hug, jog, jump, punch, push, skip, walk, wave 1 (one-hand wave), and wave 2 (two-hand wave) are included in the dataset, where each action class has 50 video clips and each clip lasts 4 s in average. The frame rate is 25 fps and the resolution is 293 × 256. Each video contains one or multiple actions performed by one or several persons. Some of them are interactions between multiple persons, shown in Figure 4.(B)XD145 We build a visible light action dataset, named XD145, following the approach to construct an action recognition dataset from the visible spectrum . In correspondence with the target domain action categories, both the XD145 and the InfAR dataset have the same action categories, as shown in Figure 5. The XD145 action dataset consists of 600 video sequences captured by visible light cameras and there are 50 video clips for each action class. All actions were performed by 30 different volunteers. Each clip lasts for 5 s in average. The frame rate is 25 fps and the resolution is 320 × 240. As shown in Figure 5, the background, pose, and viewpoint variations are considered when constructing the dataset in order to make our dataset more representative for real-world scenarios.
(k) Wave 1
(l) Wave 2
(k) Wave 1
(l) Wave 2
Figure 6 illustrates sample actions from the InfAR and the XD145 datasets. As can be seen in the figure, these action videos are captured in two different light spectra and they exhibit significantly great intraclass variance and modality gap.
(k) Wave 1
(l) Wave 2
4.2. Experimental Settings
In all experiments, each dataset is randomly split into training and testing sets. For evaluation, the average precision (AP) is used, which is the average of recognition precisions of all actions. For each evaluation, we repeat the experiments with the same setting 5 times and report the average accuracy. In KEMA, we use the RBF kernels with the bandwidth fixed as half of the median distance between the samples of the specific video (labeled and unlabeled). By doing so, we allow different kernels in each domain, thus tailoring the similarity function to the data structure observed . The trade-off parameter is set as according to the result of experimental analysis in Section 4.4.3. To build the graph Laplacians, we use a -NN graph with . We validate the optimal dimension of common latent space as well as the optimal and parameters in the SVM classifier. Since the RBF kernel is adopted in KEMA, the LibSVM  is used to train classifiers in our experiments with RBF kernel as well. The optimal parameters and are determined by 5-fold cross-validation. When performing stochastic gradient descent for encoder training, we set learning rate and momentum (see (14)). Each encoder is trained for about 1000 iterations. All experiments are conducted with MATLAB R2016b on a 64-bit Windows 10 PC with 4-core 3.60 GHz Intel i7 CPU and 16 GB of memory.
4.3. Action Recognition Results with Raw Features
We evaluate the original feature in our newly constructed visible light action dataset XD145 and infrared action recognition dataset InfAR, respectively, which is constructed as baseline in this paper. For each evaluation, we repeat the experiments with the same settings 5 times, where, for each time, the numbers of 25/20/15/10/5 samples out of 50 samples for each action category are randomly selected as the training set and samples out of the remaining samples are used as the testing set. Then, the averages are employed as the final result, as shown in Table 1.
From Table 1, we can observe that the recognition accuracies in both datasets are growing with the number of training samples increasing. This is basically in accordance with the traditional action recognition intuition that a good action recognition result can be achieved with adequate amount of labeled training samples and discriminative features. Although the same feature is adopted in both datasets, the accuracies in InfAR are much lower than that in XD145 even with the same experimental setting. This result may be due to the fact that the videos in these two datasets are captured by different sensors and they exhibit variant appearance and motion information for the same action class, as shown in Figure 6. Since the iDTs feature is good at appearance and motion description in visible light action videos , its effectiveness and strength may not be revealed in infrared videos. Therefore, utilizing existing visible light action data as an aide for enhancing infrared action recognition system is urgently needed.
4.4. Visible-to-Infrared Action Recognition Results
In this section, we evaluate our proposed method on visible-to-infrared action recognition. Our experiments are mainly divided into four parts. Firstly, a classifier trained on instances from both source and target datasets without feature alignment and generalization is utilized to predict actions in target dataset. Secondly, the classification results with aligned features obtained by KEMA are also provided. Then, we evaluate our proposed CDFAG in visible-to-infrared action recognition. Lastly, we compare our proposed CDFAG with several state-of-the-art methods.
4.4.1. Visible-to-Infrared Action Recognition
To find out whether modality gap is existent between the source and target datasets, we first train a classifier using the samples from both source and target datasets without feature alignment and generalization to predict actions in target dataset. We call this method No Adaptation (NA). In addition, the classification results with aligned features obtained by KEMA are provided to show the effectiveness of feature alignment. For each evaluation we repeat the experiments with the same parameter settings 5 times, where, for each time, the numbers of 45/40/30/25/20/15/10/5 samples for each action category in source dataset combined with the number of 25/20/15/10/5 samples for each action category in target dataset are randomly selected as the training set, in order to validate the impact of the number of training samples on recognition accuracy. Then, 20 samples for each action category in target dataset are used as the testing set. The averages are employed as the final result, as shown in Table 2.